1 Logging into the Xanadu cluster

To access the cluster you need to login with ssh (secure shell):

ssh <user_name>@xanadu-submit-ext.cam.uchc.edu 

# you user name looks like this:
ssh ssun@xanadu-submit-ext.cam.uchc.edu 

2 Tranferring data to and from the cluster

To move files in between computer you can login with sftp use scp (secure copy):

2.1 sftp:

ftp stands for “File Transfer Protocol”, sftp is " Secure File Transfer Protocol". In other words, with sftp, a useraccount and password are required.


sftp  <your_username>@<host_name>

For the Xanadu cluster, there is a special partition for transferring data:


sftp ssun@transfer.cam.uchc.edu

1. You can then navigate to the directory where you want to take files from.
2. put and get can be used to move files from or to your computer, respectively

2.2 scp

scp can be used without logging in provided you know the exact location where your file of interest is or will go. We will primarily use sftp in this course.

# for copying TO the server
scp -r <path_to_directory> <your_username>@transfer.cam.uchc.edu:~/path/to/target/folder

You should be prompted for a password. If not, the transfer probably failed.

# for copying FROM the server
scp -r <your_username>@<host_name>:<target_directory> 

2.2.1 How do we know if transfer was complete?

There’s a program called md5 (mac) or md5sum (linux) that can help us with this. It returns a compact digital fingerprint for each file. Any change to the file will result in a different fingerprint.

on a mac:

md5 Macrophage_15min_rep1_PE1.fastq.gz

on Linux:

md5sum Macrophage_15min_rep1_PE1.fastq.gz

2.3 Exercise 1: Inspecting, retrieving, and checking files from server

1 Log onto the server using ssh
2 Navigate to your folder in /labs/Guertin/siyu
3 View the checksum string for the file Macrophage_15min_rep1_PE1.fastq.gz.
4 Logout and return to your home directory or open a new terminal window (command-t)
5 Transfer the file to your computer using sftp
6 Confirm that the transfer was complete

3 Running interactive sessions and batch scripts on Xanadu

Xanadu uses the SLURM (Simple Linux Utility for Resource Management) language to manage workloads.
Documentation:
https://slurm.schedmd.com/

3.1 Interactive session

3.1.1 Important: All actions that require CPU or memory should be run from an interactive session to avoid jamming the head node.

The head node is like a lobby where everyone enters a space, but actions should not be taken here. You should request a ‘room’ or interactive session in which to conduct your business.

To see if you are on the head node you can use the hostname command. If the hostname ends in .cam.uchc.edu, then you are on the head node. If the hostname is xanadu-*, then you are in an interactive session. If you are not working in an interactive session, then exit so that resources are available to other users.


hostname
#simplest way to start a session
srun --pty -p general --qos=general  bash
hostname
exit
hostname

#request 4 cores and 16 Gb of RAM (this is what I typically request)
srun --pty --qos=general -p general -c 4 --mem=16G bash

hostname
exit

To confirm that you are indeed in an interactive session, try the ‘hostname’ command and it should return xanadu-<n>

3.2 Batch scripts

Sbatch (Slurm Batch) scripts need to start with a header that contains important information for running and reporting the output of your script. sbatch-template.sh

#!/bin/bash

#SBATCH --job-name=hello.sh     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 1                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=1G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

hostname # have this at beginning of every script for  troubleshooting

<your script>

here is a random script:

#! /bin/sh
list="1 0 3 6"

for x in $list
  do
# sleep for a few minutes so it is not immediately completed!  
    sleep 30s
    echo Welcome to Batch submissions
    echo $x
  done

echo $list

Here is one way to save it as a file:

touch welcome_batch.sh
nano welcome_batch.sh

# copy and paste the batch header and script into the nano text editor 
# Ctrl + O (save--also use the text prompt to rename if needed)
# Enter
# Ctrl + X (exit)

We can add this script to our batch script and rename it welcome_batch.sh.
To run the script we use:

sbatch welcome_batch.sh

To check the status there are several options:

squeue #check the status of the entire queue

squeue -u <usrID> # check the status of all jobs by a user

squeue -j <jobID>  # Check status of specific job

If something is wrong and you want to kill a job:

scancel <jobID>

You can also check the status of the server to see how busy it is with sinfo

sinfo --format="%10P  %6t %15O %15C %15F %10m %10e %15n %30E %10u" 

3.3 In class exercise 1: running and altering batch scripts

1. Copy the sbatch-template.sh script from above into your home directory.
2 Add the welcome_batch.sh script to your batch script, and adjust the other user input (email, jobID, etc) appropriately.
3. Save and run the script. Where did the output go?
4. create a sbatch-output folder in your home directory and alter the sbatch script such that the output and errors go to this folder.
5. intentionally create an error in your script (e.g. add double quotes but don’t close them), and run the script again. Now check the error file to see an example of error outputs.
6. Use the sbatch script to run another script from your scripts folder by calling it directly. i.e. instead of having the welcome-batch script commands in the sbatch file, create another script and run that with the batch script.

4 Server operations: Using module to temporarily load functions:

We’re going to use the cutadapt and bowtie2 amongst other software. These programs are on the server, however, we need to load them into our session for use. On our own machine we made sure the programs were in our $PATH. On the server we can use the module to load and have access to specific functions during your session without altering your $PATH. This provides flexibility and specificity in running software versions.
First, view which modules are available:

module avail  

We can check specifically for sratoolkit:

module avail cutadapt

Once you find the one you’re looking for:

module load <software>  

This is convenient because you can add module functions to your shell scripts to be sure that you are using the correct software versions.

AND
You don’t have to waste time adding lots of programs to your $PATH

It is flexible because it allows you to even switch between version of software on the fly. For instance:

# load software
module load <software_v.2>   

# do something
some commands, etc.
#unload software and load another version
module unload <software_v.2>
module load <software_v.1>

# do something else
more commands, etc.

4.1 Server operations: Running jobs in the background

  • One problem with working on the server is that if you start a command and it takes a while, you’re stuck.
  • To get around this you can run the operation in the background using the ‘&’ symbol.
  • This runs the program in the background, allowing you to continue using your command prompt while the program is executed.
  • To start a command in the background just add & to the end of the command.
  • You should get a process ID number: a unique identifier that allows you to track that operation.

To see what processes are running you can use several commands:

  • top: continuous monitoring of activity
  • ps or ‘process’: shows the activity under your user and only shows that activity at that instance (snapshot).
  • qstat: when you’re on the server.
  • squeue: (covered previously) can show jobs by, user, ID, or node.

4.2 Killing jobs

To stop runaway processes or ones that are taking too long in the background, you cannot use the normal Control-c. You can, however, scancel the process using the process ID number.

# when on the Xanadu server
scancel <process_ID>

You can also run jobs in the background on your own machine. In this case, you can use kill to cancel jobs.

kill <process_ID>

You can review other ways to view activities on the server and kill processes here: http://bioinformatics.uconn.edu/unix-basics/

5 Text editors

When you are writing a script, it is usually done locally (not on the server) and either copied into a server text editor such as emacs or nano or transfered with sftp. Certain text editors are designed for scripting and can recognize code. Importantly, they do not embed the document or fonts with hidden characteristics that can cause problems when running programs on you computer There are three features that you should look for in an editor:

1. language specific highlighting   2. line number display   3. version control

MAC USERS: Download BBedit here: http://www.barebones.com/products/bbedit/download.html?ref=tw_alert: http://www.barebones.com/products/bbedit/download.html?ref=tw_alert and then install it:

Open your text editor: gedit on Linux or BBEdit or textEdit on Mac.

Note: You can also use nano or other command line editors such as emacs or vim. I can advise in nano or emacs, but limited experience with vim.

A quick primer for nano is:

touch filename.sh

nano filename.sh

# write some code

ctrl-O save or use backspace to save as another name

make edits

ctrl-O to save

ctrl-X to exit

The first line is the Shebang line:

#! /bin/sh

or sometimes

#! /usr/bin/sh

to find out where your shell is type:

which sh

Let’s try a simple script:

ls -la
echo "Above are the directory listings for the following folder"
pwd

Create a new folder in your folder called ‘scripts’
Save your script as ls.sh
Go to the directory where ls.sh is and try to execute it:

./ls.sh

In order to run this script we need to give the proper permissions. To see permissions for files, type:

ls -la

The columns are:

1. permissions
2. owner
3. group
4. size
5. last modifiation
6. name  

In permissions: ‘d’=directory, ‘-’ = file, ‘r’ = read, ‘w’ = write, ‘x’ = execute.
Three sets of permissions are shown: User (owner), Group, Other users.

To give permission to execute type:

chmod +x ls.sh

Now use ls -la to view permissions and then try to execute.

6 Aligning ATAC-seq data

6.1 Obtain the reference genomes

First we need to download the three genomes we will align to: the chrM of mm39, hg38, and mm39. I always download from UCSC, Google “mm39 downloads ucsc” or the alike to get here: https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/

In order to get the chrM file you need to get the mm39.chromFa.tar.gz file and unzip/untar it. Then make a directory for your mm39 references, move chrM.fa into this directory, then delete everything we don’t need. Navigate to mm39 and download the full genome file. Then do the same for hg38—see if you can get this link from poking around UCSC.


wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.chromFa.tar.gz

tar -xzf mm39.chromFa.tar.gz

mkdir mm39

mv chrM.fa mm39 

rm chr*.fa
rm mm39.chromFa.tar.gz

cd mm39
ls
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz

cd ..

mkdir hg38

cd hg38

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz

6.2 Index reference genomes

Bowtie 2 is an efficient tool for aligning sequencing reads to long reference sequences. The first task is to build the genome index with bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer. This only has to be performed once per genome, but we are aligning to three separate genomes in this workflow.


module load bowtie2

gunzip hg38.fa.gz 

bowtie2-build hg38.fa hg38

cd ../mm39
bowtie2-build chrM.fa chrM_mm39
gunzip mm39.fa.gz
bowtie2-build mm39.fa mm39

6.3 Clip off adapter sequences

As Illumina reads become longer, the more likely it becomes that a read sequences into the adapter used to generate the library. A first step in most molecular genomics analyses is adapter clipping. There are many tools that accomplish this task, but we typically use cutadapt.

As an aside, there are more efficient ways to analyze data than writing a loop to go through the files in series. However, this is what I typically do in vignettes to make things simple. The quickest way I know of executing all these looped commands on all the files in parallel is to write a loop to go through all the files, pass the variable name $name as an argument to a script that generates a new sbatch script for each file set (PE1 and PE2), and execute the newly created sbatch script within the loop. I can show you an example if you’d like.

# in the home directory, create the ".bashrc" file
# open the file and write:
export PATH="$HOME/bin:$PATH"
export PATH=$PATH:~/excutive/file/directory
# each time when open a new Xanadu connection, activate the .bashrc before the commands:
source ~/.bashrc

Another aside: these are big files. If you want to make sure the scripts and commands are working, I recommend heading the first 400,000 lines of each PE file (zcat Macrophage_rep1_PE1.fastq.gz | head -4000000 > test_PE1.fastq), zipping, then running the commands only on these files so they finish quickly.

/home/FCAM/mguertin/software/fastq_pair

module load cutadapt

for fq in *PE1.fastq.gz
do
    name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
    echo $name
    gunzip $fq
    gunzip ${name}_PE2.fastq.gz
    cutadapt -j 2 -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE1.fastq -o ${name}_PE1_no_adapt.fastq
    cutadapt -j 2 -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE2.fastq -o ${name}_PE2_no_adapt.fastq
    gzip ${name}_PE1.fastq
    gzip ${name}_PE2.fastq
done

6.4 Re-pairing paired end files

cutadapt will discard reads with very short inserts, so the next step involves repairing the PE1 and PE2 files with fastq_pair. You can either download and install this, or you should be able to copy or execute the version I download and compiled onto Xanadu: /home/FCAM/mguertin/software/fastq_pair. I recommend putting it in a software directory and adding this directory to your $PATH—please let me know if you need a tutorial on the $PATH variable and your .bash_profile and .bashrc files. Next we will zip all the output files and move on to alignment with bowtie2.

for fq in *PE1.fastq.gz
do
    name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
    echo $name
    fastq_pair ${name}_PE1_no_adapt.fastq ${name}_PE2_no_adapt.fastq
    gzip ${name}_PE1_no_adapt.fastq
    gzip ${name}_PE2_no_adapt.fastq
    gzip ${name}_PE1_no_adapt.fastq.paired.fq
    gzip ${name}_PE2_no_adapt.fastq.paired.fq
done

6.5 Sequential alignments

The alignment steps include aligning to the mitochondrial chromosome, taking everything that does not align, aligning that to hg38, taking everything that does not align, then aligning that to mm39. There are some housekeeping commands included below, like pairing, zipping, and removing duplicate PE reads (i.e. each member of a pair align to the exact positions as another pair).

the -p $ncore option says to use 6 cores, but you are welcome to generate an sbatch script and ask for 30+ cores if you want the process to go quicker. A lot of the commands are hard coded, but it is better practice to initialize all the variable names in the header and pass the variables as arguments.

module load bowtie2
module load samtools
module load bedtools 

ncore=6

for fq in *PE1.fastq.gz
do
    name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
    echo $name
# align the chrM first, convert to BAM file and remove duplicates    
    bowtie2 -p $ncore --maxins 800 -x mm39/chrM_mm39 -1 ${name}_PE1_no_adapt.fastq.paired.fq.gz -2 ${name}_PE2_no_adapt.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.bam
    samtools index $name.bam
# extract all reads that do not align and generate FASTQs   
    samtools view -b $name.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.chrM_mm39.fastq -fq2 ${name}_PE2.chrM_mm39.fastq
    fastq_pair ${name}_PE1.chrM_mm39.fastq ${name}_PE2.chrM_mm39.fastq
    gzip ${name}_PE1.chrM_mm39.fastq.paired.fq
    gzip ${name}_PE2.chrM_mm39.fastq.paired.fq
    gzip ${name}_PE1.chrM_mm39.fastq
    gzip ${name}_PE2.chrM_mm39.fastq
# align to hg38 to get rid of the Jurkat reads    
    bowtie2 -p $ncore --maxins 800 -x hg38/hg38 -1 ${name}_PE1.chrM_mm39.fastq.paired.fq.gz -2 ${name}_PE2.chrM_mm39.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.hg38.bam
    samtools index $name.hg38.bam
# makes FASTQs out of everyhting that does not align    
    samtools view -b $name.hg38.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.hg38.fastq -fq2 ${name}_PE2.hg38.fastq
    fastq_pair ${name}_PE1.hg38.fastq ${name}_PE2.hg38.fastq
    gzip ${name}_PE1.hg38.fastq.paired.fq
    gzip ${name}_PE2.hg38.fastq.paired.fq
    gzip ${name}_PE1.hg38.fastq
    gzip ${name}_PE2.hg38.fastq
# Finally align to the mm39 genome    
    bowtie2 -p $ncore --maxins 800 -x mm39/mm39 -1 ${name}_PE1.hg38.fastq.paired.fq.gz -2 ${name}_PE2.hg38.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.mm39.bam
done

6.6 Swarming the cluster with individual jobs for each file/file set

An alternative and more efficient way to execute these commands, other than the serial loop above, is to automate the process of making a sbatch script and execute the sbatch scripts as resources on the cluster become available.

First we can generate and open a template file:

touch sbatch_fill_in_bowtie2_atac.sh
nano sbatch_fill_in_bowtie2_atac.sh

Next you can copy your template script with a string that you want replaced with a passed variable.


#!/bin/bash

#SBATCH --job-name=hello.sh     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 16                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=16G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

hostname 

fq=XXXXXXX
ncore=16

module load bowtie2
module load samtools
module load bedtools 

name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
echo $name
# align the chrM first, convert to BAM file and remove duplicates    
bowtie2 -p $ncore --maxins 800 -x mm39/chrM_mm39 -1 ${name}_PE1_no_adapt.fastq.paired.fq.gz -2 ${name}_PE2_no_adapt.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.bam
samtools index $name.bam
# extract all reads that do not align and generate FASTQs   
samtools view -b $name.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.chrM_mm39.fastq -fq2 ${name}_PE2.chrM_mm39.fastq
fastq_pair ${name}_PE1.chrM_mm39.fastq ${name}_PE2.chrM_mm39.fastq
gzip ${name}_PE1.chrM_mm39.fastq.paired.fq
gzip ${name}_PE2.chrM_mm39.fastq.paired.fq
gzip ${name}_PE1.chrM_mm39.fastq
gzip ${name}_PE2.chrM_mm39.fastq
# align to hg38 to get rid of the Jurkat reads    
bowtie2 -p $ncore --maxins 800 -x hg38/hg38 -1 ${name}_PE1.chrM_mm39.fastq.paired.fq.gz -2 ${name}_PE2.chrM_mm39.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.hg38.bam
samtools index $name.hg38.bam
# makes FASTQs out of everyhting that does not align    
samtools view -b $name.hg38.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.hg38.fastq -fq2 ${name}_PE2.hg38.fastq
fastq_pair ${name}_PE1.hg38.fastq ${name}_PE2.hg38.fastq
gzip ${name}_PE1.hg38.fastq.paired.fq
gzip ${name}_PE2.hg38.fastq.paired.fq
gzip ${name}_PE1.hg38.fastq
gzip ${name}_PE2.hg38.fastq
# Finally align to the mm39 genome    
bowtie2 -p $ncore --maxins 800 -x mm39/mm39 -1 ${name}_PE1.hg38.fastq.paired.fq.gz -2 ${name}_PE2.hg38.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.mm39.bam

Now you can run an interactive loop that dynamically updates the file name and submits the job.

file=sbatch_fill_in_bowtie2_atac.sh

for i in *PE1.fastq.gz
do
    nm=$(echo $i | rev | cut -f 1 -d '/' | rev | cut -f 1 -d '.')
    fq=$(echo $i | rev | cut -f 1 -d '/' | rev)
    echo $nm
    echo $fq
    sed -e "s/XXXXXXX/${fq}/g" "$file" > ${nm}_sbatch_fill_in_bowtie2_atac.sh
    sbatch ${nm}_sbatch_fill_in_bowtie2_atac.sh
done

The few annoying things that can be improved is that the fastq.gz files are currently in the working directory because passing a variable with the / character included as the string into sed causes issues. These issues can most easily be resolved by changing the sbatch_fill_in_bowtie2_atac.sh script to have directories coded in the header as variables.

One side note: leave enough space in the home directory for this step, un-zipped files are large and can take up to 600G space. use command du -sh /your/directory/ to check the space usage.

7 ATAC-seq peak calling

The next code chunk calls peaks and generates a 200 base window around the peak summit for de novo motif analysis. In the case of error message occurring: xanadu OSError: [Errno 28] No space left on device, we can set the TMPDIR environment variable (Linux, UNIX) by making a temp directory as:

mkdir -p ${HOME}/temp

then export this new temp directory as the new location for the temporary directory.


module load macs3
module load bedtools 
# export temp dir
export TMPDIR="${HOME}/temp" 

wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.chrom.sizes
name=macrophage
atacFiles=$(ls *.mm39.bam)

# call peaks
macs3 callpeak -t $atacFiles -f BAMPE -g mm -n $name -B --trackline -q 0.01 --outdir ${name}_peaks 2>&1 | tee -a Macrophage_peaks_log.txt

# remove header (start at line 2 of the file) (one edition: use tail -n +2 instead of tail +2)
tail -n +2 ${name}_peaks/${name}_summits.bed > ${name}_summits2.bed

# make a 200bp window around the summit
slopBed -l 99 -r 100 -i ${name}_summits2.bed -g mm39.chrom.sizes > ${name}_summit_window.bed

# sort
sortBed -i ${name}_summit_window.bed > ${name}_summit_window_sorted.bed

# merge
mergeBed -i ${name}_summit_window_sorted.bed > ${name}_summit_window_merged.bed

8 Converting BAM to bigWIg for counting

On Xanadu; this requires installation of program seqOutBias

module load genometools/1.5.10 # required for running seqOutBias

for i in *.mm39.bam
do
    name=$(echo $i | awk -F".mm39.bam" '{print $1}')
    echo $name
    seqOutBias mm39.fa $i --no-scale --bw=${name}.bigWig --shift-counts --skip-bed
done

9 DESeq2 counting in R

There are many versions of R available on Xanadu, but version R/4.1.2 has several of the packages we need for class preinstalled. After loading the module, typing R will take you from a command line shell that is interpreted by bash to a command line interpreted through the R language.

srun --pty --qos=general -p general -c 2 --mem=16G bash

module load R/4.1.2

R

I exclusively run R interactively and I usually run it locally, so you are welcome to take the files off the server at this point and install these libraries locally. Just make sure to back up the data if you work locally:

# load library that are already on Xanadu:
library(DESeq2)
library(lattice)
library(bigWig)

# functions for DESeq2
get.raw.counts.interval <- function(df, path.to.bigWig, file.prefix = 'M') {
    df = df[,1:3]
    vec.names = c()
    inten.df=data.frame(matrix(ncol = 0, nrow = nrow(df)))
    for (mod.bigWig in Sys.glob(file.path(path.to.bigWig,
                                          paste(file.prefix, "*.bigWig", sep ='')))) {
        factor.name = strsplit(strsplit(mod.bigWig, "/")[[1]][length(strsplit(mod.bigWig, "/")[[1]])], '\\.')[[1]][1]
        print(factor.name)
        vec.names = c(vec.names, factor.name)
        loaded.bw = load.bigWig(mod.bigWig)
        mod.inten = bed.region.bpQuery.bigWig(loaded.bw, df)
        inten.df = cbind(inten.df, mod.inten)
    }
    colnames(inten.df) = vec.names
    r.names = paste(df[,1], ':', df[,2], '-', df[,3], sep='')
    row.names(inten.df) = r.names
    return(inten.df)
}

run.deseq.list <- function(mat, untreatedreps = 3, treatedreps = 4) {
    sample.conditions = factor(c(rep("untreated", untreatedreps) , rep("treated", treatedreps)), levels=c("untreated","treated"))
    deseq.counts.table = DESeqDataSetFromMatrix(mat, DataFrame(sample.conditions), ~ sample.conditions);
    colData(deseq.counts.table)$condition<-factor(colData(deseq.counts.table)$sample.conditions, levels=c("untreated","treated"));
    dds = DESeq(deseq.counts.table);
    res = results(dds);
    res = res[order(res$padj),];
    return(res)
}


plot.ma.lattice.publishable <- function(ma.df, filename = 'file.name', adj = 0.001,
                                        title.main = "Differential Expression") {
    ma.df$group <- 'a'
    ma.df[!is.na(ma.df$padj) & !is.na(ma.df$log2FoldChange) &
          ma.df$padj < adj & ma.df$log2FoldChange < 0,]$group <- 'b'
    ma.df[!is.na(ma.df$padj) & !is.na(ma.df$log2FoldChange) &
          ma.df$padj < adj & ma.df$log2FoldChange > 0,]$group <- 'c'
    ma.df[ma.df$padj > 0.5 & !is.na(ma.df$padj) & !is.na(ma.df$log2FoldChange) &
          abs(ma.df$log2FoldChange) < 0.25,]$group <- 'd'
    pdf(paste("MA_plot_", filename, "_FDR_pval_", as.character(adj),".pdf", sep=''),
        width=3.83, height=3.83, useDingbats=FALSE)
    print(xyplot(ma.df$log2FoldChange ~ log(ma.df$baseMean, base=10), groups=ma.df$group,
                 col=c("grey85","#2290cf","#ce228e", "grey65"),
                 main=title.main, scales="free", aspect=1, pch=20, cex=0.3,
                 ylab=expression("log"[2]~"Differential ATAC"),
                 xlab=expression("log"[10]~"Mean of Normalized Counts"),
                 par.settings=list(par.xlab.text=list(cex=1.1,font=2),
                                   par.ylab.text=list(cex=1.1,font=2))))
    dev.off()
    return(ma.df)
}

process.deseq.df <- function(ma.df, filename = 'file.name', adj = 0.001, pval = NA) {
    decreased =  ma.df[ma.df$padj < adj & !is.na(ma.df$padj) & ma.df$log2FoldChange < 0,];
    increased =  ma.df[ma.df$padj < adj & !is.na(ma.df$padj) & ma.df$log2FoldChange > 0,];
    not.different =  ma.df[ma.df$pvalue > 0.2 & !is.na(ma.df$pvalue) & !is.na(ma.df$log2FoldChange) & abs(ma.df$log2FoldChange) < 0.2,];

    coor = rownames(decreased)
    coor.start = sapply(strsplit(sapply(strsplit(as.character(coor),':'), "[", 2), "-"), "[", 1);
    coor.end = sapply(strsplit(as.character(coor),'-'), "[", 2)
    coor.chr = sapply(strsplit(as.character(coor),':'), "[", 1)
    df.coor = cbind(coor.chr, coor.start, coor.end, as.character(decreased$baseMean), as.character(decreased$log2FoldChange), '+', as.character(decreased$lfcSE), as.character(decreased$pvalue), as.character(decreased$padj))
    write.table(df.coor, file = paste('decreased_', filename, '_', adj, '_FDR.bed', sep =''), sep = '\t', quote=F, row.names=F, col.names=F)
    
    coor = rownames(increased)
    coor.start = sapply(strsplit(sapply(strsplit(as.character(coor),':'), "[", 2), "-"), "[", 1);
    coor.end = sapply(strsplit(as.character(coor),'-'), "[", 2)
    coor.chr = sapply(strsplit(as.character(coor),':'), "[", 1)
    df.coor = cbind(coor.chr, coor.start, coor.end, as.character(increased$baseMean), as.character(increased$log2FoldChange), '+', as.character(increased$lfcSE), as.character(increased$pvalue), as.character(increased$padj))
    write.table(df.coor, file = paste('increased_', filename, '_', adj, '_FDR.bed', sep =''), sep = '\t', quote=F, row.names=F, col.names=F)

    coor = rownames(not.different)
    coor.start = sapply(strsplit(sapply(strsplit(as.character(coor),':'), "[", 2), "-"), "[", 1);
    coor.end = sapply(strsplit(as.character(coor),'-'), "[", 2)
    coor.chr = sapply(strsplit(as.character(coor),':'), "[", 1)
    df.coor = cbind(coor.chr, coor.start, coor.end, as.character(not.different$baseMean), as.character(not.different$log2FoldChange), '+', as.character(not.different$lfcSE), as.character(not.different$pvalue), as.character(not.different$padj))
    write.table(df.coor, file = paste('notDifferent_', filename, '_', adj, '_FDR.bed', sep =''), sep = '\t', quote=F, row.names=F, col.names=F)
}

#running processes
# define variables:
directory = "/full/path/to/the/bigwig/files"
MAC.file = read.table("macrophage_summit_window_merged.bed") 

#only count on well annotated chromosomes:
chr.keep = paste0("chr",c(1:19))
MAC.file = MAC.file[MAC.file[,1] %in% c(chr.keep, 'chrX', 'chrY'),]

# count reads afalling into your peaks
df.MAC = get.raw.counts.interval(MAC.file, directory, file.prefix = 'MAC')

# 
head(df.MAC)
save(df.MAC, file = "df.MAC.Rdata")

10 Exploration

10.1 experiment setting

Data from Dr. Kodimangalam S. Ravichandran’s lab group in WUSTL. In the study, they harvested the mouse macrophage and challenged the cell with or without human apoptotic cells to induce efferocytosis at the indicated time points for ATAC-seq experiments.

Experiment setting

10.2 PCA

# load library:
library(DESeq2)
library(DEGreport)
library(tibble)
library(lattice)
library(tidyr)
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
#load the count table that contains gene counts across all conditions.
load("/directory/to/df.MAC.Rdata")
sample.conditions = c(sapply(strsplit(as.character(colnames(df.MAC)), '_'), '[', 2)[1:12], rep("t0",4))
sample.conditions = factor(sample.conditions, levels=c("t0","15min","30min","45min"))

deseq.counts.table = DESeqDataSetFromMatrix(df.MAC, as.data.frame(sample.conditions), ~ sample.conditions)
dds = DESeq(deseq.counts.table)

normalized.counts = counts(dds, normalized=TRUE)
save(normalized.counts,file='normalized.counts.atac.Rdata')
# generate PCA plot
rld = rlog(dds, blind=TRUE)
x = plotPCA(rld, intgroup="sample.conditions", returnData=TRUE)
plotPCAlattice(x, file = 'PCA_Macrophage_lattice_guertin.pdf')

PCA_Macrophage

10.3 LRT

#lrt
dds.lrt = DESeq(dds, test="LRT", reduced = ~ 1)
res.lrt = results(dds.lrt)

10.3.1 decide the p adjusted value (FDR) cut-off

library(dplyr)
library(tidyr)
#decide a padj cutoff
FDR<-as.data.frame(res.lrt$padj)
FDR$DAregions<-rownames(res.lrt)

# FDR_sort<-arrange(FDR, res.lrt$padj)
# data.perc=0.2
# FDR_sort[round(data.perc*nrow(FDR_sort)),]

pads<-seq(0, 210, 0.05)
FDR$log10padj<-log10(FDR$`res.lrt$padj`)
FDR$log10padjround<-round(log10(FDR$`res.lrt$padj`), digits=2)

FDR.modi<-na.omit(FDR) %>%count(log10padjround)

for (i in 2:nrow(FDR.modi)){
  FDR.modi$accum_n[1]=FDR.modi$n[1]
  FDR.modi$accum_n[i]=FDR.modi$n[i]+FDR.modi$accum_n[(i-1)]
}
plot(FDR.modi$log10padjround, FDR.modi$accum_n, xlim=c(-12,0), xaxt='n',
   xlab="log10padjround", ylab="accumulative number of DA regions", pch=19)
axis(side=1, at=seq(-100, 0, by=0.5))
abline(v=-1.3467, col="red", lty=2, lwd=3)

#the closer to 0, the higher the padj

P_adj_cutoff0.045

10.3.2 use the cut-off to extract data that we are interested of

padj.cutoff = 0.045 

siglrt.re = res.lrt[res.lrt$padj < padj.cutoff & !is.na(res.lrt$padj),]

dim(siglrt.re)                                    

rld_mat <- assay(rld)
cluster_rlog = rld_mat[rownames(siglrt.re),]
meta = as.data.frame(sample.conditions)
rownames(meta) = colnames(cluster_rlog)

save.image('220609_macrophage.Rdata') # year==22, month==06, day ==09
save(cluster_rlog, meta, sample.conditions, file = 'cluster_rlog_pval_0.045.Rdata')
family=c("Dynamic", "Nondynamic")
peaks=c(34056, 136659)
df=data.frame(family, peaks)
df <- df %>% 
  mutate( class= "ATAC peaks")
pdf('number of peaks.pdf', width=5,height=5)

print(ggplot(y, aes(x=class, y=peaks, fill=factor(family, levels=c("Dynamic", "Nondynamic")))) +  geom_col(width = .5) +
        geom_text(aes(label = peaks),
            position = position_stack(vjust = 0.5)) +
  labs(title = 'Number of ATAC Peaks', y = 'Number of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("coral","bisque"))+
    theme(text = element_text(size = 20)) 
)
dev.off()

number of peaks.pdf

10.3.3 ATAC clustering

running on Xanadu:

module load gcc R
library(DESeq2)
library(DEGreport)
library(tibble)
library(lattice)
load('cluster_rlog_pval_0.045.Rdata')
     
clusters.all.test.0.045 <- degPatterns(cluster_rlog, metadata = meta, minc = 100, time = "sample.conditions", col=NULL, eachStep = TRUE)
save(clusters.all.test.0.045, file = '170339_clusters.all.minc100.pval0.045.Rdata')

Note that degPatterns doesn’t calculate significant difference between groups, so the matrix used as input should be already filtered to contain only genes that are significantly different or the most interesting genes to study.

minc: integer minimum number of genes in a group that will be return.

#load data
load('170339_clusters.all.minc100.pval0.045.Rdata') 
#using plotiing_clusters_LRT.R to optimize plotting functions with a small data frame

plot.df = clusters.all.test.0.045$normalized 

plot.df$sample.conditions = as.character(plot.df$sample.conditions)
plot.df$sample.conditions[plot.df$sample.conditions == 't0'] = 0
plot.df$sample.conditions[plot.df$sample.conditions == '15min'] = 15
plot.df$sample.conditions[plot.df$sample.conditions == '30min'] = 30
plot.df$sample.conditions[plot.df$sample.conditions == '45min'] = 45

plot.df$sample.conditions = as.numeric(plot.df$sample.conditions)
plot.df = plot.df[order(plot.df$genes),]
plot.df = plot.df[order(plot.df$sample.conditions),]

plot.df$cluster = paste('cluster', as.character(plot.df$cluster), sep = '')

plot.df$chr = sapply(strsplit(plot.df$genes, '[.]'), '[', 1)
plot.df$start = sapply(strsplit(plot.df$genes, '[.]'), '[', 2)
plot.df$end = sapply(strsplit(plot.df$genes, '[.]'), '[', 3)

10.3.4 dynamic peaks

save(plot.df,file='plot.df.Rdata')

write.table(cbind(plot.df$chr, plot.df$start, plot.df$end), 
            file = 'dynamic_peaks.bed', quote = FALSE, 
            sep = '\t', col.names=FALSE, row.names=FALSE)

10.3.5 nondynamic peaks

#Generate nondynamic peaks set for FIMO.
not.different = rownames(res.lrt[res.lrt$padj > 0.5 & !is.na(res.lrt$padj) & !is.na(res.lrt$log2FoldChange) & abs(res.lrt$log2FoldChange) < 0.25,])
#not.different = rownames(res.lrt[res.lrt$padj > 0.1 & !is.na(res.lrt$padj) & !is.na(res.lrt$baseMean) & res.lrt$baseMean > 10,])
chr = sapply(strsplit(not.different, ':'), '[', 1)
x = sapply(strsplit(not.different, ':'), '[', 2)
start = sapply(strsplit(x, '-'), '[', 1)
end = sapply(strsplit(x, '-'), '[', 2)

curated.not.different = data.frame(chr,start,end)
write.table(curated.not.different,file='nondynamic_peaks.bed',sep='\t',col.names=F,row.names=F,quote=F)

10.3.6 all.other peaks

in xanadu:

module load gcc bedtools  R

#collect ATAC peaks that were not sorted into dynamic/nondynamic peaks
intersectBed -v -a macrophage_summit_window.bed   #Only report those entries in A that have _no overlaps_ with B.
        
                -b *peaks.bed > all_other_peaks.bed #first three column
all_other<-unique(read.table("all_other_peaks.bed"))
write.table(all_other[,1:3],file='allother_peaks.bed',sep='\t',col.names=F,row.names=F,quote=F)

10.3.7 plot

10.3.7.1 plot individual clusters

#just plot loess

loess.df = data.frame(matrix(ncol = 3, nrow = 0),  stringsAsFactors = FALSE)
colnames(loess.df) = c('ATAC','time','cluster')
for (i in unique(plot.df$cluster)){
    ATAC = loess.smooth(plot.df[plot.df$cluster == i,]$sample.conditions, plot.df[plot.df$cluster == i,]$value,  span = 1/2, degree = 1, family = c("gaussian"))$y
    time = loess.smooth(plot.df[plot.df$cluster == i,]$sample.conditions, plot.df[plot.df$cluster == i,]$value, span = 1/2, degree = 1, family = c("gaussian"))$x
    new = as.data.frame(cbind(ATAC, time, i),  stringsAsFactors = FALSE)
    colnames(new) = c('ATAC','time','cluster')
    new$ATAC = as.numeric(new$ATAC)
    new$time = as.numeric(new$time)
    loess.df = rbind(loess.df, new)
}
colors.clusters = rep('white', 7)
colors.clusters[c(4,6)] = 'blue' # transient decrease
colors.clusters[1] = 'cornflowerblue' #decrease-increase-decrease

colors.clusters[c(3,7)] = 'red' #transient increase
colors.clusters[5] = 'tomato3'  #gradual increase
colors.clusters[2] = 'lightsalmon' #increase-decrease-increase
pdf(paste('Clusters_individual_loess','.pdf', sep=''), width=4, height=4)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
xyplot(ATAC ~  time, groups = cluster, data = loess.df, type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       auto.key = T,
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       ylab = list(label = 'cluster loess of ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=colors.clusters, cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = colors.clusters, lwd=c(3),
                                                 lty = c(1))))

      )
key=list(space="right",
         lines=list(col=c("blue", "cornflowerblue", "red", "tomato3", "lightsalmon")), 
         lty=c(3,2), 
         lwd=6,
         text=list(c("cluster1", "cluster4,6", "cluster3,7", "cluster5", "cluster2"))
)


dev.off()

dynamic_cluster

10.3.7.2 plot all traces

#this has a bwplot, loess curve, points, and lines, which can be removed from the panel function to customize
# Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression. Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced. 
df = data.frame(index=1:7,cluster=unique(plot.df.atac$cluster)[order(unique(plot.df.atac$cluster))])
df$cluster.num = as.integer(sapply(strsplit(df$cluster, 'cluster'), '[', 2))
df = df[order(df$cluster.num),]
df = df[reorder(df$cluster.num,c(4,6,1,5,3,7,2)),]

pdf('Clusters_minc100_stringent_pvalue.pdf', width=5, height=15) 

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
xyplot(value ~  sample.conditions | cluster, group = genes, data = plot.df.atac, type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       aspect=1.0,
       layout = c(2,5),
       between=list(y=0.5, x=0.5),
       index.cond=list(rev(df$index)),
       skip = c(F,T,
                F,F,
                F,T,
                F,T,
                F,F),
       ylab = list(label = 'Normalized ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = c('#99999980'), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 6, do.out = FALSE)
           panel.spline(x, y, col = 'blue', lwd =2.0, ...)
           
})

      )
dev.off()

dynamic_cluster

10.3.7.3 plot dendrogram

generate the dendrogram to confirm if we could merge some clusters

library(data.table)
x = as.data.table(plot.df)
plot.df.cluster = dcast(x, genes + cluster ~ sample.conditions, value.var="value")

avg.clusters = as.data.frame(matrix(nrow = 0, ncol = 4))
for (i in unique(plot.df.cluster$cluster)) {
    z = data.frame(matrix(colMeans(plot.df.cluster[plot.df.cluster$cluster == i,3:6]), 
                            ncol = 4, nrow = 1))
    rownames(z) = c(i)
    colnames(z) = as.character(colnames(plot.df.cluster)[3:6])
    avg.clusters = rbind(avg.clusters, z)
}


dd = dist(avg.clusters)
hc = hclust(dd, method = "complete")

pdf('dendrogram.pdf', width=8, height=5)
plot(hc, xlab = "Clusters", main = ' ', hang = -1)
abline(h = 1.5, lty = 2)
dev.off()

dendrogram

10.3.7.4 plot superclusters

cluster3 & 7

pdf(paste('Clusters_3_7','.pdf', sep=''), width=3.5, height=3.5, useDingbats =FALSE)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
    xyplot(value ~  sample.conditions , group = genes, data =
                       plot.df[plot.df$cluster == 'cluster3' | 
    plot.df$cluster == 'cluster7',]
         , type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       ylab = list(label = 'Normalized ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = c('#99999980'), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15, do.out = FALSE)
           panel.loess(x, y, ..., col = "blue", lwd =2.0,  span = 1/2, degree = 1, family = c("gaussian"))
           
})

      )
dev.off()

dynamic_cluster

cluster 4 & 6

pdf(paste('Clusters_4_6','.pdf', sep=''), width=3.5, height=3.5, useDingbats =FALSE)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
    xyplot(value ~  sample.conditions , group = genes, data =
                       plot.df[plot.df$cluster == 'cluster4' | 
    plot.df$cluster == 'cluster6',]
         , type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       ylab = list(label = 'Normalized ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = c('#99999980'), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15, do.out = FALSE)
           panel.loess(x, y, ..., col = "blue", lwd =2.0,  span = 1/2, degree = 1, family = c("gaussian"))
           
})

      )
dev.off()

#### plot all superclusters

# increase dynamics:
gradual.up = plot.df[plot.df$cluster == 'cluster5',]

up.down = plot.df[plot.df$cluster == 'cluster3' |
                    plot.df$cluster == 'cluster7' ,]

up.down.up = plot.df[plot.df$cluster == 'cluster2',]

#decrease dynamics:

down.up = plot.df[plot.df$cluster == 'cluster4'|
                    plot.df$cluster == 'cluster6',]

down.up.down = plot.df[plot.df$cluster == 'cluster1',]


gradual.up$supercluster = 'Gradual Increase' 
up.down$supercluster = 'Transient Increase'
up.down.up$supercluster = 'Oscillating Increase' 
down.up$supercluster = 'Transient Decrease' 
down.up.down$supercluster = 'Oscillating Decrease' 


plot.df.atac = rbind(gradual.up,
      up.down,
      up.down.up,
      down.up,
      down.up.down)


plot.df.atac = plot.df.atac[,2:16]
plot.df.atac$genes = paste0(plot.df.atac$chr,':',plot.df.atac$start,'-',plot.df.atac$end)
save(plot.df.atac,file='plot.df.atac.Rdata')
plot.df.atac$genes = as.factor(plot.df.atac$genes)
cat.colours = plot.df.atac[plot.df.atac$merge == 'one_groupt0',]
cat.colours$genes <- as.factor(cat.colours$genes)
cat.colours$supercluster <- as.factor(cat.colours$supercluster)

cat.colours$colour[cat.colours$supercluster == 'Oscillating Increase'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'Transient Increase'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'Gradual Increase'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'Oscillating Decrease'] <- '#0000FF08'
cat.colours$colour[cat.colours$supercluster == 'Transient Decrease'] <- '#0000FF08'

cat.colours$colour <- as.factor(cat.colours$colour)

cat.colours <- cat.colours[match(levels(plot.df.atac$genes), cat.colours$genes), ]
pdf('atac_superclusters.pdf', width=11, height=5)
trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
xyplot(value ~  sample.conditions | supercluster, group = genes,
       data = plot.df.atac, type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=0.8,relation = "free", rot = 45),
       y =list(cex=0.8, relation="free")),
       aspect=1.0, 
       between=list(y=0.5, x=0.5), 
       layout = c(5,1), 
       ylab = list(label = 'Normalized ATAC signal', cex=1.0), 
       xlab = list(label = 'Time (minutes)', cex =1.0), 
       par.settings = list(superpose.symbol = list(pch = c(16), 
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 3,
                           do.out = F)
          panel.spline(x, y, col = 'black', lwd =3.0, ...) 
           
})

      )
dev.off()

down.up: cluster 4 & 6
down.up.down: cluster 1
gradual.up: cluster 5
up.down: cluster 3 & 7
up.down.up: cluster 2

11 de novo Motif analysis

11.1 MEME de novo motif analysis

for (i in unique(plot.df$cluster)) {
    print(i)
    write.table(plot.df[plot.df$cluster == i,
                        c('chr','start','end', 'value', 'cluster')][!duplicated(plot.df[plot.df$cluster == i,]$genes),],
                file = paste0('cluster_bed_',
                              gsub(" ", "", i, fixed = TRUE),'.bed'),
                quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
}

11.1.1 de novo motif identification

run MEME on each cluster.bed files to find de novo motifs
running on Xanadu:

wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.chrom.sizes
mkdir meme_motif_enrichment

for i in *cluster*bed
do
    name=$(echo $i | awk -F"_" '{print $NF}' | awk -F".bed" '{print $1}')
    echo $name
    echo '#SBATCH -o' $name'.meme.out' > temp.txt
    echo 'i='$i > temp2.txt
    cat meme_slurm_header_1.txt temp.txt \
        meme_slurm_header_2.txt temp2.txt \
        meme_slurm_header_3.txt > $name.meme.sh
    sbatch $name.meme.sh                                                                                                                                                           
    rm temp.txt
    rm temp2.txt
done

(1)meme_slurm_header_1.txt

#!/bin/bash

#SBATCH --job-name=meme1        # name for job
#SBATCH -N 1                    # number of nodes
#SBATCH -n 1                    # number of jobs / tasks
#SBATCH -c 48                    # number of cores
#SBATCH --mem=128G                # RAM (memory) requested

(2)meme_slurm_header_2.txt

#SBATCH -p general           # SLURM partition
#SBATCH --qos=general        # SLURM Quality of service
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

hostname # have this at beginning of every script for  troubleshooting

module purge

module load gcc bedtools/2.26.0 meme/5.4.1

(3)meme_slurm_header_3.txt

cd meme_motif_enrichment
name=$(echo $i | awk -F"_" '{print $NF}' | awk -F".bed" '{print $1}')
echo $name        
minSite=$((wc -l ${name}.fasta | awk '{print $1/2}')| awk '{print $1/50}')
head ${name}.fasta
 meme -p 59 -oc ${name}.meme_output -nmotifs 15 \
          -objfun classic -csites 20000 -evt 0.01 -searchsize 0 -minw 6 \
          -maxw 18 -revcomp -dna -markov_order 3 -minsites $minSite -maxsize 100000000  \
          ${name}.fasta

cd ..

Till this step, we have identified several de novo motifs within each cluster, the output file are named as clusterX.meme_output.

11.1.2 de novo motifs identification

pulls individual de novo motif from all the clusters and save each as its own file

module purge
module load gcc meme/5.4.1

for name in *output
do
  echo $name
  cd $name
  mkdir individual_memes
  cd individual_memes
  wget https://raw.githubusercontent.com/guertinlab/adipogenesis/master/motif_analysis/MEME_individual_from_db.py
  python2.7 MEME_individual_from_db.py -i ../meme.txt
  cd ..
done
cd ..
mkdir individual_memes
cd individual_memes
cp ../*meme_output/individual_memes/*_meme.txt ..

cd ..

cat individual_memes/*_meme.txt > all_denovo_motif.txt

tomtom against itself

mkdir tomtom_all_query_factors
cd tomtom_all_query_factors

for i in ../individual_memes/*meme.txt
do
    name=$(echo $i | awk -F"/" '{print $3}' | awk -F"/me" '{print $1}' | awk -F".meme" '{print $1}')
    echo $name
    tomtom -no-ssc -o $name.tomtom_output -verbosity 1  -incomplete-scores -min-overlap 1 -dist ed -evalue -thresh 0.0005 $i ../all_denovo_motif.txt
    cd $name.tomtom_output
    cut -f1,2,5 tomtom.tsv | tail -n +2 | sed '$d' | sed '$d' | sed '$d' | sed '$d' >> ../3_col_combined_motif_db_pre.txt
    cd ..
done

# tomtom all_denovo_motif.txt against all_denovo_motif.txt also works, generates the same 3_col_combined_motif_db_pre.txt file

grep -v '#' 3_col_combined_motif_db_pre.txt > 3_col_combined_motif_db.txt

#rm 3_col_combined_motif_db_pre.txt
cp 3_col_combined_motif_db.txt ..

11.1.3 define de novo motif families

A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.

PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery.

run this: https://raw.githubusercontent.com/guertinlab/adipogenesis/master/Pipeline_ATAC/4.2_visualize_pswms.R

#install.packages("igraph")
#install.packages("dichromat")
library(igraph)
library(dichromat)

#setwd('/your/directory/to/3_col_combined_motif_db.txt')

threecol=read.table("3_col_combined_motif_db.txt",header=F,stringsAsFactors = F,sep='\t')
colnames(threecol)=c('from','to','e_value')
threecol$weight=abs(log(threecol$e_value))

#create the graph variable
g=graph.data.frame(threecol,directed=F)
g=simplify(g)

cluster=clusters(g)

for(i in 1:length(groups(cluster))) {
write.table(groups(cluster)[i],file=paste0('PSWM_family_',i,'.txt'), 
    col.names = F, row.names = F, quote = F, sep = '\t')
}

l=layout.fruchterman.reingold(g)
l=layout.norm(l,-1,1,-1,1)

Visualize PSWM

colfunc<-colorRampPalette(c("red","yellow","springgreen","royalblue","purple","green", "salmon","orange","black", "blue" )) # 10 families
#pick a distinct color 
mycol = colfunc(length(groups(cluster)))

pdf(file='families.pdf',width=10,height=10)
plot(g,layout=l,rescale=F,vertex.label.cex=.5,xlim=range(l[,1]),  ylim=range(l[,2]),
edge.width=E(g)$weight/20,vertex.size=degree(g,mode='out')/5,
edge.curved=T,vertex.label=NA,vertex.color=mycol[cluster$membership],
margin=0,asp=0)
dev.off()

PSWM_visualization

Generate composite de novo motif of each family and composite de novo sequence logo

#!/bin/bash

wget https://raw.githubusercontent.com/guertinlab/adipogenesis/master/Pipeline_ATAC/misc_scripts/tomtom_output_to_composite.py
wget https://github.com/guertinlab/atac_rotation/blob/main/generate_composite_motif.R # or directly download from github

wget https://raw.githubusercontent.com/guertinlab/adipogenesis/master/Pipeline_ATAC/misc_scripts/meme_header.txt

Query TOMTOM to calculate index and values for PSWM

#!/bin/bash

mkdir composite_denovo_motifs
cd composite_denovo_motifs

module purge
module load gcc meme/5.4.1


for txt in ../PSWM_family_*.txt
do
    dir_name=$(echo $txt | awk -F'../' '{print $2}' | awk -F'.txt' '{print $1}')
    echo $dir_name
    mkdir $dir_name
    cd $dir_name
    
    cat ../$txt | while read line
    do
echo $line
cp ../../individual_memes/${line}_meme.txt_meme.txt $PWD
    done
    echo ''
    query_factor=`head -1 ../$txt`
    if [[ $(wc -l < ../$txt) -ge 2 ]]
    then    
cat ../$txt | { while read line
do
query_factor=$line
rm ref_factors_meme.txt
mv ${query_factor}_meme.txt_meme.txt ..
cat *_meme.txt > ref_factors_meme.txt
mv ../${query_factor}_meme.txt_meme.txt $PWD
tomtom -no-ssc -o ${query_factor}.tomtom_output -verbosity 1 -incomplete-scores -min-overlap 1 -dist ed -evalue -thresh 0.0005 ${query_factor}_meme.txt_meme.txt ref_factors_meme.txt                
if [[ $(wc -l < ${query_factor}.tomtom_output/tomtom.tsv) -ge $max_motif ]]
then
max_motif=$(wc -l < ${query_factor}.tomtom_output/tomtom.tsv)
final_query=$query_factor
fi
done
echo FINAL_QUERY IS $final_query
wc -l ${final_query}.tomtom_output/tomtom.tsv
cd ${final_query}.tomtom_output
python2.7 /home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/tomtom_output_to_composite.py -i tomtom.xml
mv tomtom.xml_test_index_pswm.txt ../composite.values.txt
mv tomtom.xml_test_index_rc_offset.txt ../composite.index.txt
cd ../..
}
fi
    cd ..
done

Generate composite motif and sequence logo

#!/bin/bash

for family in PSWM*
do
    cd $family
    num=$(ls *txt | wc -l)

    if [[ $num -ge 2 ]]
    then
    
    #generate composite PSWM
    module load gcc R/4.1.2
    #dr=$(pwd)   
    Rscript ../../generate_composite_motif.R $family 
    cat /home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/meme_header.txt ${family}_composite_PSWM.txt > ${family}_meme.txt    
    else
    line=`grep MOTIF *meme.txt`
    cp *meme.txt ${family}_meme.txt
    sed -i "s;${line};MOTIF   Composite;g" ${family}_meme.txt
    fi
    
    #generate logo
    module load gcc meme/5.4.1
    ceqlogo -i ${family}_meme.txt -m Composite -o ${family}.eps
    ceqlogo -i ${family}_meme.txt -m Composite -o ${family}.rc.eps -r
    cd ..
    
done

cd ..

Extracting de novo Motif sequences  

mkdir motifid_clusters
cd motifid_clusters

echo 'Extracting Motif Names'

for file in ../*tomtom_output/*.txt
do
    name=$(echo $file | awk -F"/" '{print $(NF-1)}' | awk -F".tomtom_output" '{print $1}')
    echo $name
    motifid=$name
    #echo $file
    linenum=$(awk 'END {print NR}' $file)
    #echo $linenum
    first=2
    i=$first
    if [[ $linenum != 5 ]]
    then
    mkdir $name
    while [[  $i -le $linenum ]]
    do
        head -$i $file | tail -1 > lastline
        mapid[$i]=$(awk 'END {print $2}' lastline | awk -F"(" '{print $1}')
        #echo mapid[$i]_${mapid[$i]}
        echo ${mapid[$i]} >> $name/motifidlist_$motifid.txt
        ((i = i + 1))
    done
    head -n $(( $(wc -l $name/motifidlist_$motifid.txt | awk '{print $1}') - 4 )) $name/motifidlist_$motifid.txt > $name/motifidlist_$motifid.final.txt
    rm $name/motifidlist_$motifid.txt
    mv $name/motifidlist_$motifid.final.txt $name/motifidlist_$motifid.txt
    fi
done

rm lastline
cat */motifidlist*txt > all_matched_motifs.txt
mv all_matched_motifs.txt ..
cd ..
sort all_matched_motifs.txt | uniq -u > all_matched_motifs_uniq.txt

11.1.4 Prepare known TF databases

Now we will use the composite de novo motif against the known database.

Generate homer_uniprobe_jaspar PSWM database

wget http://meme-suite.org/meme-software/Databases/motifs/motif_databases.12.19.tgz
tar -xzf motif_databases.12.19.tgz
#JASPAR
mv motif_databases/JASPAR/JASPAR2018_CORE_vertebrates_non-redundant.meme $PWD
#Uniprobe
mv motif_databases/MOUSE/uniprobe_mouse.meme $PWD

#Homer
#CAUTION: the HOMER_MEME_conversion.py was written for python2 so remember to specify python2.7 if you have python3 installed.
wget https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/HOMER_MEME_conversion.py
wget http://homer.ucsd.edu/homer/custom.motifs
python2.7 HOMER_MEME_conversion.py -i custom.motifs -o homer.motifs

#edit databases to work with tomtom
cp JASPAR2018_CORE_vertebrates_non-redundant.meme JASPAR_edited_meme.txt
grep MOTIF JASPAR_edited_meme.txt > motifs.txt
cat motifs.txt | while read motif
do
    name=$(echo $motif | awk -F" " '{print $NF}')
    temp=$(echo 'MOTIF' $name'_jaspar')
    echo $temp
    sed -i "s;${motif};${temp};g" JASPAR_edited_meme.txt
done
rm motifs.txt

#edit databases to work with tomtom
cp uniprobe_mouse.meme uniprobe_edited_meme.txt
grep MOTIF uniprobe_edited_meme.txt > motifs.txt
cat motifs.txt | while read motif
do
    name=$(echo $motif | awk -F" " '{print $NF}')
    temp=$(echo 'MOTIF' $name'_uniprobe')
    echo $temp
    sed -i "s;${motif};${temp};g" uniprobe_edited_meme.txt
done
#remove 'secondary' motifs
sed -i -e '4210,8365d;' uniprobe_edited_meme.txt

rm motifs.txt

#edit databases to work with tomtom
cp homer.motifs_meme.txt homer_edited_meme.txt
grep MOTIF homer_edited_meme.txt > motifs.txt
cat motifs.txt | while read motif
do
    name=$(echo $motif | awk -F" " '{print $NF}')
    name=$(echo $name | awk -F"/" '{print $1}')
    temp=$(echo 'MOTIF' $name'_homer')
    echo $temp
    sed -i "s;${motif};${temp};g" homer_edited_meme.txt
done
rm motifs.txt

#Collect all database motifs into one file
cat homer_edited_meme.txt \
    uniprobe_edited_meme.txt JASPAR_edited_meme.txt > homer_uniprobe_jaspar_edited.txt
mkdir individual_db_memes

#extract individual meme files from combined database
#CAUTION: you only need to run this once, even if you're working through the code again
cd individual_db_memes
wget https://raw.githubusercontent.com/guertinlab/adipogenesis/master/motif_analysis/MEME_individual_from_db.py
python2.7 MEME_individual_from_db.py -i ../homer_uniprobe_jaspar_edited.txt

11.1.5 TOMTOM the composite de novo motifs against the TF database

#!/bin/bash
module purge
module load gcc meme/5.4.1

mkdir composite_denovo_db

# tomtom to database

for family in composite_denovo_motifs/PSWM*
do
    name=$(echo $family | awk -F"composite_denovo_motifs/PSWM_" '{print $NF}')
    echo $name
    tomtom -no-ssc -o composite_denovo_db/$name.tomtom_output -verbosity 1 -min-overlap 5 -dist ed -evalue -thresh 0.05 composite_denovo_motifs/PSWM_${name}/*${name}_meme.txt homer_uniprobe_jaspar_edited.txt
    tomtom -no-ssc -o composite_denovo_db/$name.tomtom_output -verbosity 1 -min-overlap 5 -dist ed -text -evalue -thresh 0.05 composite_denovo_motifs/PSWM_${name}/*${name}_meme.txt homer_uniprobe_jaspar_edited.txt > composite_denovo_db/$name.tomtom_output/tomtom.txt
done

11.1.6 Identify the top hit

extract top hit motif names from tomtom output

mkdir motifid_top_PSWM
cd motifid_top_PSWM

echo 'Extracting top hitted Motif Names in each family'
for file in ../composite_denovo_db/*family*tomtom_output/*.txt
do
    name=$(echo $file | awk -F"/" '{print $(NF-1)}' | awk -F".tomtom_output" '{print $1}')
    echo $name
    motifid=$name
    #echo $file
    linenum=$(awk 'END {print NR}' $file) #number of lines in the txt file
    #echo $linenum 
    first=2
    i=$first
      if [[ $linenum != 5 ]]
      then
      
      sed '1d; /^$/d; /#/d' $file > ../composite_denovo_db/$name.tomtom_output/tomtom_comp.txt
      fi
done

for file in ../composite_denovo_db/*family*tomtom_output/tomtom_comp.txt
do
      name=$(echo $file | awk -F"/" '{print $(NF-1)}' | awk -F".tomtom_output" '{print $1}')
      echo $name
      awk 'NR == 1 || $4 < min {line = $0; min = $4}END{print line}' $file | awk '{print $2}' >> motifidlist_$name.txt
    
done


cat *.txt> top_database_PSWM_motifs.txt
cp top_database_PSWM_motifs.txt ..


sort top_database_PSWM_motifs.txt | uniq -u > top_database_PSWM_motifs_uniq.txt
cp top_database_PSWM_motifs_uniq.txt ..
cd ..

extract individual meme files from combined database

#!/bin/bash

#tomtom all query factors against all others
mkdir top_database_PSWM
cd top_database_PSWM


cat ../top_database_PSWM_motifs_uniq.txt | while read factor
do
    echo $factor
    cp ../individual_db_memes/${factor}_meme.txt $PWD
done

cat *meme.txt > ../all_query_factors_meme.txt
# make motif logos for the top hit
module load gcc meme/5.4.1      
for motif in *.txt
do
    name=$(echo $motif | awk -F"_meme.txt" '{print $1}' )
    echo $name

    ceqlogo -i ${motif} -m $name -o ${name}.eps
    ceqlogo -i ${motif} -m $name -o ${name}.rc.eps -r
    
done

11.1.7 top motif hit seqlogos

seqlogo

11.2 Genomic FIMO

#!/bin/bash

#SBATCH --job-name=fimo_composites_top     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 4                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=200G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

module load gcc meme/5.4.1
# export temp dir
export TMPDIR="${HOME}/temp" 

mkdir fimo_top_database_PSWM
cd fimo_top_database_PSWM

for i in ../top_database_PSWM/*_meme.txt
do
    name=$(echo $i | awk -F"../top_database_PSWM/" '{print $NF}' | awk -F"_meme.txt" '{print $1}')
    echo $name
    
    #run FIMO
    
    fimo --thresh 0.001 --text $i \
        /home/FCAM/ssun/motifanalysis/mm39.fa \
        > ${name}_fimo.txt

    #this takes top 2M
    score=$(tail -n +2 ${name}_fimo.txt | sort -nrk6,6 | awk 'FNR == 2000000 {print $6}')
    echo $score
    tail -n +2 ${name}_fimo.txt | awk -v sco="$score" '{ if ($6 >= sco) { print } }' | \
        awk '{OFS="\t";} {print $2,$3,$4,$7,$6,$5,$8}' > ${name}_2M.txt

    #this was to get the order of conformity to consensus.
    tomtom -no-ssc -oc ${name}_ranks.tomtom_output -verbosity 1 -incomplete-scores -min-overlap 1 \
        -dist ed -evalue -thresh 0.05 ../all_query_factors_meme.txt ${name}_fimo.txt
done
#!/bin/bash
module load gcc bedtools
for i in *_2M.txt
do
    name=$(echo $i | awk -F"/" '{print $NF}' | awk -F"_2M.txt" '{print $1}')
    echo $name
    intersectBed -loj -a ../dynamic_peaks.bed -b $i > ${name}_fimo.bed
    intersectBed -loj -a ../nondynamic_peaks.bed -b $i > ${name}_fimo_nondyn.bed
    intersectBed -loj -a ../allother_peaks.bed -b $i > ${name}_fimo_allother.bed
    cat $i | cut -f1-3,5 | sort -k1,1 -k2,2n > ${name}_2M.bed 
done

12 Generating comprehensive ATAC dataframe

Non-dynamic peaks: padj > 0.5 & log2FoldChange < 0.25
N=49111
Dynamic peaks: padj <= 0.045
N=34056

#!/bin/bash
mkdir main_figure_beds
cd fimo_top_database_PSWM
cp *_fimo.bed ../main_figure_beds
cp *_fimo_nondyn.bed ../main_figure_beds
cd main_figure_beds
module load gcc R/4.1.2

R
library(ggplot2)
library(lattice)
library(pheatmap)
library(fabricatr)
library(ggseqlogo)

#load('plot.df.atac.Rdata')

#generate 'fimo.scores.atac.Rdata' object
res = unique(read.table('dynamic_peaks.bed'))
colnames(res) = c('chr', 'start', 'end')
res$start = as.numeric(as.character(res$start))
res$end = as.numeric(as.character(res$end))
rownames(res) = paste0(res[,1], ':', res[,2], '-', res[,3])

#main figure families

for(bed.file in Sys.glob(file.path('*_fimo.bed'))) {
    
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]   
    y = aggregate(as.numeric(V8)~V1+V2+V3, data=x, FUN=sum)
    colnames(y) = c('chr', 'start', 'end', factor.name)
    rownames(y) = paste0(y[,1], ':', y[,2], '-', y[,3])
    
    func <- function(peak) {
        if(peak %in% rownames(y)) {
            return(y[rownames(y) == peak,ncol(y)])
        } else {
            return(NA)
        }
    }
    
    res[,ncol(res)+1] = sapply(rownames(res),func)
    colnames(res)[ncol(res)] = factor.name

}

res = res[,-c(1:3)]

fimo.scores.atac = res
save(fimo.scores.atac, file = 'fimo.scores.atac.Rdata')

# save as bed files
# for(i in 1:ncol(fimo.scores.atac)) {
    # temp = fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]
    # chr = sapply(strsplit(rownames(temp), ':'), '[', 1)
    # x = sapply(strsplit(rownames(temp), ':'), '[', 2)
    # start = sapply(strsplit(x, '-'), '[', 1)
    # end = sapply(strsplit(x, '-'), '[', 2)
    # bed = data.frame(chr, start, end, score = temp[,i])
    # write.table(bed, file = paste0(colnames(fimo.scores.atac)[i],'_instances.bed'),
    #    row.names=F,col.names=F,quote=F,sep='\t')
# }

non-dynamic peaks as baseline

# generate fimo.scores.nondyn.atac.Rdata
res_c = unique(read.table('nondynamic_peaks.bed'))
colnames(res_c) = c('chr', 'start', 'end')
res_c$start = as.numeric(as.character(res_c$start))
res_c$end = as.numeric(as.character(res_c$end))
rownames(res_c) = paste0(res_c[,1], ':', res_c[,2], '-', res_c[,3])

#main figure families

for(bed.file in Sys.glob(file.path('*_fimo_nondyn.bed'))) {
    
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo_nondyn.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]   
    y = aggregate(as.numeric(V8)~V1+V2+V3, data=x, FUN=sum)
    colnames(y) = c('chr', 'start', 'end', factor.name)
    rownames(y) = paste0(y[,1], ':', y[,2], '-', y[,3])
    
    func <- function(peak) {
        if(peak %in% rownames(y)) {
            return(y[rownames(y) == peak,ncol(y)])
        } else {
            return(NA)
        }
    }
    
    res_c[,ncol(res_c)+1] = sapply(rownames(res_c),func)
    colnames(res_c)[ncol(res_c)] = factor.name

}

res_c = res_c[,-c(1:3)]

fimo.scores.nondyn.atac = res_c
save(fimo.scores.nondyn.atac, file = 'fimo.scores.nondyn.atac.Rdata')

all other peaks

# generate fimo.scores.all.other.atac.Rdata
res_c = unique(read.table('allother_peaks.bed'))
colnames(res_c) = c('chr', 'start', 'end')
res_c$start = as.numeric(as.character(res_c$start))
res_c$end = as.numeric(as.character(res_c$end))
rownames(res_c) = paste0(res_c[,1], ':', res_c[,2], '-', res_c[,3])

#main figure families

for(bed.file in Sys.glob(file.path('*_fimo_allother.bed'))) {
    
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo_allother.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]   
    y = aggregate(as.numeric(V8)~V1+V2+V3, data=x, FUN=sum)
    colnames(y) = c('chr', 'start', 'end', factor.name)
    rownames(y) = paste0(y[,1], ':', y[,2], '-', y[,3])
    
    func <- function(peak) {
        if(peak %in% rownames(y)) {
            return(y[rownames(y) == peak,ncol(y)])
        } else {
            return(NA)
        }
    }
    
    res_c[,ncol(res_c)+1] = sapply(rownames(res_c),func)
    colnames(res_c)[ncol(res_c)] = factor.name

}
res_c = res_c[,-c(1:3)]

fimo.scores.all.other.atac = res_c
save(fimo.scores.all.other.atac, file = 'fimo.scores.all.other.atac.Rdata')

12.0.1 add status info

in R

load("fimo.scores.atac.Rdata") # contains increase or decreased peaks
load("fimo.scores.nondyn.atac.Rdata") # contains unchanged peaks
load('fimo.scores.all.other.atac.Rdata') # contains all other peaks
load('plot.df.atac.Rdata') 

plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$genes = as.factor(plot.df$genes)
no.fam = rownames(fimo.scores.atac[which(rowSums(is.na(fimo.scores.atac)) == ncol(fimo.scores.atac)),])
temp = plot.df.atac[plot.df.atac$genes %in% no.fam,]
temp$family = 'No Family'
plot.df = rbind(plot.df,temp)

plot.df$status = 'Increase'
plot.df[plot.df$supercluster == 'Transient Decrease' | plot.df$supercluster == 'Oscillating Decrease',]$status = 'Decrease'
df1<-fimo.scores.nondyn.atac
df1$genes<-rownames(df1)
df1$chr = sapply(strsplit(df1$genes, ':|-'), '[', 1)
df1$start = sapply(strsplit(df1$genes, ':|-'), '[', 2)
df1$end = sapply(strsplit(df1$genes, ':|-'), '[', 3)
df1$status<-"unchanged"

df2<-fimo.scores.atac
df2$genes<-rownames(df2)
df2$chr = sapply(strsplit(df2$genes, ':|-'), '[', 1)
df2$start = sapply(strsplit(df2$genes, ':|-'), '[', 2)
df2$end = sapply(strsplit(df2$genes, ':|-'), '[', 3)
label<-unique(plot.df[,c(16,18)])
df2<-merge(df2,label,by="genes")


df3<-fimo.scores.all.other.atac
df3$genes<-rownames(df3)
df3$chr = sapply(strsplit(df3$genes, ':|-'), '[', 1)
df3$start = sapply(strsplit(df3$genes, ':|-'), '[', 2)
df3$end = sapply(strsplit(df3$genes, ':|-'), '[', 3)
df3$status<-"all.other"

fimo_scores_allpeaks<-rbind(df1,df2,df3)
save(fimo_scores_allpeaks, file = 'fimo.scores.allpeaks.atac.Rdata')

12.0.2 add counts info

load('fimo.scores.allpeaks.atac.Rdata')
df = data.frame(fimo_scores_allpeaks)
# add counts
load('normalized.counts.atac.Rdata')
zero.min = c()
fifteen.min = c()
thirty.min = c()
fourtyfive.min = c()

genes = c()

print('Getting ATAC means')
for (i in 1:nrow(normalized.counts)) {
    zero.min = append(zero.min, mean(normalized.counts[i,13:16]))
    fifteen.min = append(fifteen.min, mean(normalized.counts[i,1:4]))
    thirty.min = append(thirty.min, mean(normalized.counts[i,5:8]))
    fourtyfive.min = append(fourtyfive.min, mean(normalized.counts[i,9:12]))
    
    genes = append(genes,rownames(normalized.counts)[i])
}

atac.means = data.frame(row.names = genes, 
                        min.0 = zero.min, 
                        min.15 = fifteen.min,
                        min.30 = thirty.min, 
                        min.45 = fourtyfive.min)

save(atac.means,file='atac.means.Rdata')

atac.means$genes<-rownames(atac.means)
df = merge(df, atac.means, by='genes')

12.0.3 add cluster info

# add cluster information
# down.up: cluster 4 & 6
# down.up.down: cluster 1
# gradual.up: cluster 5
# up.down: cluster 3 & 7
# up.down.up: cluster 2

cluster.df = unique(plot.df.atac[,c(5,15,16)])

df = merge(df,cluster.df,by='genes', all=T)

#rownames(df) = df[,1]
#df = df[,-1]

atac.comprehensive =df
save(atac.comprehensive,file='atac.comprehensive.Rdata')

12.0.4 add Pro-genes and TSS location info

mastercoords<-read.table("mastercoords.bed") #contains genes, strand, TSS info from Pro-seq
#TSS locarion
mastercoords1<-mastercoords
mastercoords1$TSS<-NA
for (i in 1:nrow(mastercoords1)){
  if (mastercoords1[i,6]=="+"){
  mastercoords1[i,7]=mastercoords1[i,2]
  } else{
  mastercoords1[i,7]=mastercoords1[i,3]
  }
}
load("atac.comprehensive.Rdata")
atac.comprehensive1 = atac.comprehensive
atac.comprehensive1$summit = as.numeric(atac.comprehensive1$start)+100
atac.comprehensive1$Progenes=NA
atac.comprehensive1$TSS=NA
atac.comprehensive2=data.frame()
for (i in 1:length(unique(atac.comprehensive1$chr))){
  chro=paste0("chr", i)
  for (j in 1:nrow(mastercoords1[mastercoords1$V1==chro,])){
    
      temp=subset(atac.comprehensive1[atac.comprehensive1$chr==chro,],
                  atac.comprehensive1[atac.comprehensive1$chr==chro,]$summit > (mastercoords1[mastercoords1$V1==chro,][j,7]-10000) & 
                    atac.comprehensive1[atac.comprehensive1$chr==chro,]$summit < (mastercoords1[mastercoords1$V1==chro,][j,7]+10000))
      
      if (nrow(temp)>0) {
      temp$Progenes = mastercoords1[mastercoords1$V1==chro,][j,4]
      temp$TSS = mastercoords1[mastercoords1$V1==chro,][j,7]
        }
      
      atac.comprehensive2=rbind(atac.comprehensive2, temp)
      }
}
atac.comprehensive2$location=abs(as.numeric(atac.comprehensive2$summit)  - as.numeric( atac.comprehensive2$summit)
                                )
save(atac.comprehensive2, file="atac.comprehensive2.Rdata")

13 motif analysis visualization

13.0.1 prepare data

load("fimo.scores.atac.Rdata") # contains increase or decreased peaks
load("fimo.scores.nondyn.atac.Rdata") # contains unchanged peaks
load('fimo.scores.all.other.atac.Rdata') # contains all other peaks
load('plot.df.atac.Rdata') 

# change TF motif name to family name
colnames(fimo.scores.atac)<- c("AP1", "CTCF-like factor", "Jun-related factor", "NFY", "Znf143", "Spi1", "SP1-like factor" )
colnames(fimo.scores.nondyn.atac) <-c("AP1", "CTCF-like factor", "Jun-related factor", "NFY", "Znf143", "Spi1", "SP1-like factor" )
colnames(fimo.scores.all.other.atac) <-c("AP1", "CTCF-like factor", "Jun-related factor", "NFY", "Znf143", "Spi1", "SP1-like factor" )
plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$genes = as.factor(plot.df$genes)
no.fam = rownames(fimo.scores.atac[which(rowSums(is.na(fimo.scores.atac)) == ncol(fimo.scores.atac)),])
temp = plot.df.atac[plot.df.atac$genes %in% no.fam,]
temp$family = 'No Family'
plot.df = rbind(plot.df,temp)

plot.df$status = 'Increase'
plot.df[plot.df$supercluster == 'Transient Decrease' | plot.df$supercluster == 'Oscillating Decrease',]$status = 'Decrease'
plot.df1=unique(plot.df[,c(5,15:18)])
plot.df2 = data.frame()
for(i in 1:ncol(fimo.scores.nondyn.atac)) {
    
     temp=fimo.scores.nondyn.atac[!is.na(fimo.scores.nondyn.atac[,i]),]
     temp$family=colnames(fimo.scores.nondyn.atac)[i]
     temp$genes=rownames(temp)
     plot.df2 = rbind(plot.df2,temp[,8:9])
}

plot.df2$genes = as.factor(plot.df2$genes)
no.fam = rownames(fimo.scores.nondyn.atac[which(rowSums(is.na(fimo.scores.nondyn.atac)) == ncol(fimo.scores.nondyn.atac)),])
temp = data.frame(no.fam)
colnames(temp)="genes"
temp$family = 'No Family'

plot.df2 = rbind(plot.df2,temp)

plot.df2$status = 'unchanged'
plot.df2$cluster = 'unchanged'
plot.df2$supercluster = 'unchanged'
plot.df.sum=rbind(plot.df1, plot.df2)

13.0.2 count the number/fraction of up, down, unchanged ATAC peaks with a genomic instance of the PSWM

raw counts

increased = table(plot.df.sum[plot.df.sum$status == 'Increase',]$family) 
decreased = table(plot.df.sum[plot.df.sum$status == 'Decrease',]$family) 
unchanged = table(plot.df.sum[plot.df.sum$status == 'unchanged',]$family) 

df = data.frame(names = c(names(increased),names(decreased), names(unchanged)),
                num = c(as.vector(increased),as.vector(decreased), as.vector(unchanged)),
                cond = c(rep('Increase',length(increased)),rep('Decrease',length(decreased)),rep('unchanged',length(unchanged))),
                index = rep(as.vector(table(plot.df$family)),3))
df[df$names == 'No Family',]$index = min(df$index) - 1
  1. plot number of peaks in each motif family
library(ggplot2)
pdf(file='number of peaks in each family motif.pdf',width=12,height=7)
print(
        ggplot(df,aes(x = reorder(names,-index),y = num,fill=cond)) +
        geom_bar(stat='identity',position='stack',color='black') +
        #geom_text(aes(label=num),position=position_stack(vjust = 0.5),size=6) +
        labs(title = 'Number of Peaks For Each Motif Family', y = 'Number of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank(),
              axis.text.x = element_text(angle=45,size=20,hjust=.99,vjust=1,color='black',face='bold'),
              axis.text.y = element_text(size=20,face='bold',color='black'),
              axis.title.y = element_text(size=18,face='bold'),
              legend.text = element_text(size=18,face='bold'),
              plot.title = element_text(size=22,face='bold',hjust=0.5)) +
        scale_fill_manual(values = c("blue","red","grey"))
)
dev.off()
#library(ggplot2)
pdf(file='number of dynamic peaks in each family motif.pdf',width=12,height=7)
print(
        ggplot(df[1:16,],aes(x = reorder(names,-index),y = num,fill=cond)) +
        geom_bar(stat='identity',position='stack',color='black') +
        #geom_text(aes(label=num),position=position_stack(vjust = 0.5),size=6) +
        labs(title = 'Number of Peaks For Each Motif Family', y = 'Number of Dynamic Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank(),
              axis.text.x = element_text(angle=45,size=20,hjust=.99,vjust=1,color='black',face='bold'),
              axis.text.y = element_text(size=20,face='bold',color='black'),
              axis.title.y = element_text(size=18,face='bold'),
              legend.text = element_text(size=18,face='bold'),
              plot.title = element_text(size=22,face='bold',hjust=0.5)) +
        scale_fill_manual(values = c("blue","red","grey"))
)
dev.off()

fraction

df_sum<-df
df_sum$fraction=c(df_sum[df_sum$cond=="Increase",]$num/length(unique(plot.df.sum[plot.df.sum$status=="Increase",]$genes)), df_sum[df_sum$cond=="Decrease",]$num/length(unique(plot.df.sum[plot.df.sum$status=="Decrease",]$genes)), df_sum[df_sum$cond=="unchanged",]$num/length(unique(plot.df.sum[plot.df.sum$status=="unchanged",]$genes)))
df_sum$fraction<-(df_sum$fraction)*100
  1. plot percent of peaks in each motif family

this is showing the fraction of each motif in increased peaks/decreased peaks/unchanged peaks.

pdf(file="percent of each motif family occurrance.pdf",width=8,height=5)
    print(
        ggplot(df_sum,aes(x = reorder(names,-num),y = fraction,fill=cond)) +
        geom_bar(stat='identity',position='dodge',color='black') + 
        labs(title = "% of Peaks For Each Motif Family",
             y = "% of Peaks",
             x = 'Motif Family',
             fill = 'Direction') +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank(),
              axis.text.x = element_text(angle=45,size=20,hjust=.99,vjust=1,color='black',face='bold'),
              axis.text.y = element_text(size=20,face='bold',color='black'),
              axis.title.y = element_text(size=18,face='bold'),
              legend.text = element_text(size=18,face='bold'),
              plot.title = element_text(size=22,face='bold',hjust=0.5)) +
        scale_fill_manual(values = c("red","blue","gray"))
    )
dev.off() 

motif_occurance

13.0.3 ChiSq test

fimo.scores.atac1<-fimo.scores.atac
fimo.scores.atac1$genes<-rownames(fimo.scores.atac1)

#increased peaks
Increased_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$status=="Increase",]$genes)) 
colnames(Increased_peaks)<-"genes"
Increased_peaks<-merge(fimo.scores.atac1, Increased_peaks, by="genes")

#decreased peaks
Decreased_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$status=="Decrease",]$genes)) 
colnames(Decreased_peaks)<-"genes"
Decreased_peaks<-merge(fimo.scores.atac1, Decreased_peaks, by="genes")

#unchanged peaks
unchanged_peaks<-fimo.scores.nondyn.atac
unchanged_peaks$genes<-rownames(fimo.scores.nondyn.atac)
#check total number of increase, decrease, unchanged, all other ATAC peaks
family=c("Increase","Decrease", "unchanged", "all.other")
peaks=c(nrow(Increased_peaks), nrow(Decreased_peaks), nrow(unchanged_peaks), nrow(fimo.scores.all.other.atac))
y2=data.frame(family, peaks)
# library(dplyr)
y2 <- y2 %>% 
  mutate( class= "ATAC peaks")
pdf('number of peaks2.pdf', width=5,height=4)

print(ggplot(y2, aes(x=class, y=peaks, fill=factor(family, levels=c("Increase", "Decrease", "unchanged", "all.other")))) +  geom_col(width = .3) +
        geom_text(aes(label = peaks),
            position = position_stack(vjust = 0.5)) +
  labs(title = 'Number of ATAC Peaks', y = 'Number of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("coral","bisque","beige", "azure2"))+
    theme(text = element_text(size = 20)) 
)
dev.off()

motif_occurance

df_sum_withfam <-matrix(ncol=3, nrow=7)
for (i in 1:7){
  df_sum_withfam[i,1]<- as.numeric(table(is.na(unchanged_peaks[[i]]))[[1]])
  df_sum_withfam[i,2]<- as.numeric(table(is.na(Increased_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,3]<- as.numeric(table(is.na(Decreased_peaks[[i+1]]))[[1]])
}
df_sum_withfam<-as.data.frame(df_sum_withfam)
rownames(df_sum_withfam)<-c(colnames(unchanged_peaks)[1:7])
colnames(df_sum_withfam)<-c("unchanged","Increase", "Decrease")

#count peaks with no family
df_sum_nofam <-matrix(ncol=3, nrow=7)
for (j in 1:7){
  df_sum_nofam[j,1]<-as.numeric(table(is.na(unchanged_peaks[[j]]))[[2]])
  df_sum_nofam[j,2] <-as.numeric(table(is.na(Increased_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,3] <-as.numeric(table(is.na(Decreased_peaks[[j+1]]))[[2]])
}

df_sum_nofam<-as.data.frame(df_sum_nofam)
rownames(df_sum_nofam)<-c("no AP1", "no CTCF-like factor", "no Jun-related factor", "no NFY",  "no Znf143", "no Spi1", "no SP1-like factor")
colnames(df_sum_nofam)<-c("unchanged","Increase", "Decrease")

df_sum2<-rbind(df_sum_nofam, df_sum_withfam)

# write.csv(df_sum2, file='peak number of motif.csv')
chidata = data.frame(matrix(ncol = 5, nrow = 7))
for (i in 1:7){
  chidata$fam_vs_no_spe_fam[i]<-as.character(rownames(df_sum2)[i+7])
  #chidata$Gain_X_squared[i]<-chisq.test(df_sum2[c(i,i+9),1:2], correct=F)$statistic
  chidata$Inc_p_value[i]<-chisq.test(df_sum2[c(i,i+7),1:2], correct=F)$p.value
  #chidata$lose_X_squared[i]<-chisq.test(df_sum2[c(i,i+9),1:3], correct=F)$statistic
  chidata$dec_p_value[i]<-chisq.test(df_sum2[c(i,i+7),1:3], correct=F)$p.value
  #chidata$lose_Gain_X_squared[i]<-chisq.test(df_sum2[c(i,i+9),2:3], correct=F)$statistic
  chidata$Inc_Dec_p_value[i]<-chisq.test(df_sum2[c(i,i+7),2:3], correct=F)$p.value
  
}
chidata<-chidata[order(chidata$Inc_Dec_p_value),][,6:9]

chisq_test

13.0.4 bar plot

df_sum2_fraction<-df_sum2
df_sum2_fraction$Increase <-100*df_sum2_fraction$Increase/nrow(Increased_peaks)
df_sum2_fraction$unchanged <-100*df_sum2_fraction$unchanged/nrow(unchanged_peaks)
df_sum2_fraction$Decrease <-100*df_sum2_fraction$Decrease/nrow(Decreased_peaks)

library("reshape")
df_sum2_fraction_mut<-df_sum2_fraction
df_sum2_fraction_mut$family<-rownames(df_sum2_fraction)
df_sum2_fraction_mut<-melt(df_sum2_fraction_mut, 
                           id.vars="family",
                           variable.name="class",
                           value.name="fraction")

df_sum2_fraction_mut$variable <- factor(df_sum2_fraction_mut$variable, levels = c('Increase', 'unchanged', 'Decrease'))
# AP1
num<-c(1,8,15,22,29,36)
x<-df_sum2_fraction_mut[num,]
pdf('percent peaks with or without AP1.pdf', width=8,height=5)
print(ggplot(x, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no AP1", "AP1" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without AP1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 25)) 
)
dev.off()

AP1

library(ggplot2)
num <-c(1,8,15,22,29,36)

temp <- df_sum2_fraction_mut[num,]
plot_AP1 <- ggplot(temp, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no AP1", "AP1" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without AP1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+2,]
plot_Jun <- ggplot(temp, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no Jun-related factor", "Jun-related factor" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without JUN', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+6,]
plot_SP1 <- ggplot(temp, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no SP1-like factor", "SP1-like factor" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without SP1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+3,]
plot_NFY <- ggplot(temp, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no NFY", "NFY" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without NFY', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+1,]
plot_CTCF <- ggplot(temp, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no CTCF-like factor", "CTCF-like factor" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without CTCF', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+4,]
plot_Znf143 <- ggplot(temp, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no Znf143", "Znf143" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without Znf143', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+5,]
plot_Spi1 <- ggplot(temp, aes(x = factor(variable, level=c("Increase", "unchanged", "Decrease")), y=value, fill=factor(family, levels=c("no Spi1", "Spi1" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without Spi1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("white", "darkgrey"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 
#library(ggplot2)
library(patchwork)
      
pdf(file='Proportions of peaks with or without each family motif.pdf', width=12,height=10)

(plot_AP1 | plot_Jun)/
  (plot_SP1 | plot_NFY)

  ( plot_CTCF | plot_Znf143)/ 
plot_Spi1

dev.off()

peak_proportion

13.0.5 bar plot by cluster

fimo.scores.atac1<-fimo.scores.atac
fimo.scores.atac1$genes<-rownames(fimo.scores.atac1)

# down.up: cluster 4 & 6
down.up_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="Transient Decrease",]$genes)) 
colnames(down.up_peaks)<-"genes"
down.up_peaks<-merge(fimo.scores.atac1, down.up_peaks, by="genes")

# down.up.down: cluster 1
down.up.down_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="Oscillating Decrease",]$genes)) 
colnames(down.up.down_peaks)<-"genes"
down.up.down_peaks<-merge(fimo.scores.atac1, down.up.down_peaks, by="genes")

# gradual.up: cluster 5
gradual.up_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="Gradual Increase",]$genes)) 
colnames(gradual.up_peaks)<-"genes"
gradual.up_peaks<-merge(fimo.scores.atac1, gradual.up_peaks, by="genes")

# up.down: cluster 3 & 7
up.down_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="Transient Increase",]$genes)) 
colnames(up.down_peaks)<-"genes"
up.down_peaks<-merge(fimo.scores.atac1, up.down_peaks, by="genes")

# up.down.up: cluster 2
up.down.up_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="Oscillating Increase",]$genes)) 
colnames(up.down.up_peaks)<-"genes"
up.down.up_peaks<-merge(fimo.scores.atac1, up.down.up_peaks, by="genes")

#unchanged peaks
unchanged_peaks<-fimo.scores.nondyn.atac
unchanged_peaks$genes<-rownames(fimo.scores.nondyn.atac)
df_sum_withfam <-matrix(ncol=6, nrow=7)
for (i in 1:7){
  df_sum_withfam[i,1]<- as.numeric(table(is.na(unchanged_peaks[[i]]))[[1]])
  df_sum_withfam[i,2]<- as.numeric(table(is.na(down.up_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,3]<- as.numeric(table(is.na(down.up.down_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,4]<- as.numeric(table(is.na(gradual.up_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,5]<- as.numeric(table(is.na(up.down_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,6]<- as.numeric(table(is.na(up.down.up_peaks[[i+1]]))[[1]])
 
}
df_sum_withfam<-as.data.frame(df_sum_withfam)
rownames(df_sum_withfam)<-c(colnames(unchanged_peaks)[1:7])
colnames(df_sum_withfam)<-c("unchanged","Transient Decrease", "Oscillating Decrease","Gradual Increase","Transient Increase","Oscillating Increase")

#no CRE
df_sum_nofam <-matrix(ncol=6, nrow=7)
for (j in 1:7){
  df_sum_nofam[j,1]<- as.numeric(table(is.na(unchanged_peaks[[j]]))[[2]])
  df_sum_nofam[j,2]<- as.numeric(table(is.na(down.up_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,3]<- as.numeric(table(is.na(down.up.down_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,4]<- as.numeric(table(is.na(gradual.up_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,5]<- as.numeric(table(is.na(up.down_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,6]<- as.numeric(table(is.na(up.down.up_peaks[[j+1]]))[[2]])
}

df_sum_nofam<-as.data.frame(df_sum_nofam)
rownames(df_sum_nofam)<-c("no AP1", "no CTCF-like factor", "no Jun-related factor", "no NFY",  "no Znf143", "no Spi1", "no SP1-like factor")
colnames(df_sum_nofam)<-c("unchanged","Transient Decrease", "Oscillating Decrease","Gradual Increase","Transient Increase","Oscillating Increase")
df_sum2<-rbind(df_sum_nofam, df_sum_withfam)

df_sum2_fraction<-df_sum2
df_sum2_fraction$unchanged<-100*df_sum2_fraction$unchanged/nrow(unchanged_peaks)
df_sum2_fraction$`Transient Decrease`<-100*df_sum2_fraction$`Transient Decrease`/nrow(down.up_peaks)
df_sum2_fraction$`Oscillating Decrease`<-100*df_sum2_fraction$`Oscillating Decrease`/nrow(down.up.down_peaks)
df_sum2_fraction$`Gradual Increase`<-100*df_sum2_fraction$`Gradual Increase`/nrow(gradual.up_peaks)
df_sum2_fraction$`Transient Increase`<-100*df_sum2_fraction$`Transient Increase`/nrow(up.down_peaks)
df_sum2_fraction$`Oscillating Increase`<-100*df_sum2_fraction$`Oscillating Increase`/nrow(up.down.up_peaks)


library("reshape")
df_sum2_fraction_mut<-df_sum2_fraction
df_sum2_fraction_mut$family<-rownames(df_sum2_fraction)
df_sum2_fraction_mut<-melt(df_sum2_fraction_mut, 
                           id.vars="family",
                           variable.name="class",
                           value.name="fraction")

df_sum2_fraction_mut$variable <- factor(df_sum2_fraction_mut$variable, levels = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease"))
library(ggplot2)
num <-c(1,8,15,22,29,36,43,50,57,64,71,78)

temp <- df_sum2_fraction_mut[num,]
plot_AP1 <- ggplot(temp, aes(x = factor(variable, level = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease")), y=value, fill=factor(family, levels=c("no AP1", "AP1" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without AP1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("beige", "chartreuse"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+2,]
plot_Jun <- ggplot(temp, aes(x = factor(variable, level = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease")), y=value, fill=factor(family, levels=c("no Jun-related factor", "Jun-related factor" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without JUN', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("beige", "chartreuse"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+6,]
plot_SP1 <- ggplot(temp, aes(x = factor(variable, level = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease")), y=value, fill=factor(family, levels=c("no SP1-like factor", "SP1-like factor" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without SP1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("beige", "chartreuse"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+3,]
plot_NFY <- ggplot(temp, aes(x = factor(variable, level = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease")), y=value, fill=factor(family, levels=c("no NFY", "NFY" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without NFY', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("beige", "chartreuse"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+1,]
plot_CTCF <- ggplot(temp, aes(x = factor(variable, level = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease")), y=value, fill=factor(family, levels=c("no CTCF-like factor", "CTCF-like factor" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without CTCF', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("beige", "chartreuse"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+4,]
plot_Znf143 <- ggplot(temp, aes(x = factor(variable,level = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease")), y=value, fill=factor(family, levels=c("no Znf143", "Znf143" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without Znf143', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("beige", "chartreuse"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 

temp <- df_sum2_fraction_mut[num+5,]
plot_Spi1 <- ggplot(temp, aes(x = factor(variable, level = c("Gradual Increase","Transient Increase","Oscillating Increase","unchanged", "Transient Decrease", "Oscillating Decrease")), y=value, fill=factor(family, levels=c("no Spi1", "Spi1" )))) +  geom_bar(stat='identity',position='stack',color='black') +
  labs(title = '% peaks with/without Spi1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("beige", "chartreuse"))+
  theme(axis.text.x=element_text(angle=45, hjust=0.9))+
    theme(text = element_text(size = 20)) 
#library(ggplot2)
library(patchwork)

pdf(file='Proportions of peaks with or without each family motif by cluster.pdf', width=15,height=10)

(plot_AP1 | plot_Jun)/
  (plot_SP1 | plot_NFY)

  ( plot_CTCF | plot_Znf143)/ 
plot_Spi1

dev.off()

peak_proportion

13.0.6 FIMO scores violin plot box and whisker plots for isolated peaks

plot.df = data.frame()

for(i in 1:ncol(fimo.scores.atac)) {
      scores.temp = fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]
      scores.temp = scores.temp[,-i]
      scores.temp = scores.temp[which(rowSums(is.na(scores.temp)) == ncol(scores.temp)),]
      temp = plot.df.atac[plot.df.atac$genes %in% rownames(scores.temp),]
      temp$family = colnames(fimo.scores.atac)[i]
      plot.df = rbind(plot.df,temp)
}

plot.df$status = 'Increase'
plot.df[plot.df$supercluster == 'Transient Decrease' | plot.df$supercluster == 'Oscillating Decrease',]$status = 'Decrease'

plot.df = unique(plot.df[,16:18])

func <- function(peak) {
    family = plot.df[plot.df$genes == peak,]$family    
    colnum = which(colnames(fimo.scores.atac) == family)
    score = fimo.scores.atac[rownames(fimo.scores.atac) == peak,colnum]
    return(score)
}

plot.df$score = sapply(plot.df$genes,func)
pdf(file = paste0('fimo.scores.isolated.peaks.violin.pdf'),width=12,height=8)

trellis.par.set(box.umbrella = list(lty = 1, col=c("red", "blue"), lwd=2),
                box.rectangle = list( lwd=2.0, col=c("red", "blue"), alpha = 1.0),
                plot.symbol = list(col=c("red", "blue"), lwd=2.0, pch ='.'))

print(
    bwplot(log(as.numeric(as.character(score)), base = 10) ~ status | family, data = plot.df, 
           horizontal = FALSE, pch = '|', do.out = FALSE,
           scales=list(x=list(cex=1.0, relation = "free", rot = 45), y = list(cex=1.0, relation="free")),
           aspect=2.0,
           between=list(y=0.5, x=0.5),
           index.cond=list(c(1,3,5,4,2,7,6)),
           ylab = expression('log'[10]* paste('(motif scores)')),
           xlab = expression('ATAC Peak category'),
                                        #manually setting to avoid outlier, since do.out = FALSE                
           panel = function(..., box.ratio) {
        panel.violin(..., col = "transparent", border = "grey60", varwidth = FALSE, 
            box.ratio = box.ratio)
        panel.bwplot(..., fill = NULL, box.ratio = 0.1)
    },
           #ylim = list(c(0.9, 1.9),  c(0.9, 1.7), c(0.8, 1.8), c(1, 2.15),  c(0.96, 1.4)),
           par.settings = list(superpose.symbol = list(pch = c(16),col=c('grey20'), cex =0.5),
                               #I want to change the background strip to the corresponding motif color
                               strip.background=list(col=c("grey80"))))
    
)
dev.off()

violin_plot

13.0.7 motif density plot

13.1 by unchanged, gain, lose

Generate composites for all activated and repressed peaks

pswm.fimo <- function(fimo.in, out = 'outfilename', nm = 'bHLH_Activated', rc = FALSE) {
    posnum = nchar(fimo.in$sequence[1])
    fimo.in$sequence = toupper(fimo.in$sequence)
    col.matrix = matrix()
    for (g in 1:posnum){
        itnum = lapply(strsplit(as.character(fimo.in$sequence), ''), "[", g)
        if (g == 1) {
            col.matrix = itnum
        } else {
            col.matrix = cbind(col.matrix, itnum)
        }
    }  
  a.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "A"))
  t.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "T"))
  c.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "C"))
  g.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "G"))
  
  pswm = cbind(a.nuc, c.nuc, g.nuc, t.nuc)
  print(pswm)
  outfile = file(paste0(out, '.txt'))
  on.exit(close(outfile))
  writeLines(c("MEME version 4", "ALPHABET= ACGT", "strands: + -", " ", 
               "Background letter frequencies (from uniform background):", 
               "A 0.30000 C 0.20000 G 0.20000 T 0.30000", paste("MOTIF", out), " ",
               paste("letter-probability matrix: alength= 4 w=", posnum)), outfile)
  pswm = pswm/rowSums(pswm)
  if (rc == "TRUE"){
    pswm<- pswm[nrow(pswm):1,ncol(pswm):1]
  } else {}   
    write.table(pswm, file = paste0(out, '.txt'), append = TRUE, 
                quote=FALSE, row.names =FALSE, col.names = FALSE)
    pswm = t(pswm)
    rownames(pswm) = c('A', 'C', 'G', 'T')
    return(pswm)
    #    
#    system(paste0('/Users/guertinlab/meme/libexec/meme-5.1.1/ceqlogo -i ', out, '.txt -m Composite -o ', nm, '.eps'))
#    system(paste0('/Users/guertinlab/meme/libexec/meme-5.1.1/ceqlogo -i ', out, '.txt -m Composite -o ', nm, '.rc.eps -r'))
}
plot.df = plot.df.atac
plot.df$status = 'Increase'
plot.df[plot.df$supercluster == 'Transient Decrease' | plot.df$supercluster == 'Oscillating Decrease',]$status = 'Decrease'
count = -1
vec.names = c()

df.seq = as.data.frame(matrix(ncol=5, nrow=0),stringsAsFactors = FALSE)
colnames(df.seq) = c('chr', 'start', 'end', 'sequence', 'factor')

for(bed.file in Sys.glob(file.path('*_fimo.bed'))) {
    #print(bed.file)
    count = count + 1
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]
    x = x[,c(1,2,3,10)]
    x$factor = factor.name
    colnames(x) = c('chr', 'start', 'end', 'sequence', 'factor')
    x$re = paste0(x[,1], ':', x[,2], '-', x[,3])
    df.seq = rbind(df.seq, x) 
}
func <- function(peak) {
     return(unique(plot.df[plot.df$genes == peak,]$status))
}

df.seq$status = sapply(df.seq$re,func)

for(i in unique(df.seq$factor)) {
      act = df.seq[df.seq$factor == i & df.seq$status == 'Increase',]
      rep = df.seq[df.seq$factor == i & df.seq$status == 'Decrease',]
      pswm.fimo(act, out = paste0(i,'_increased'))
      pswm.fimo(rep, out = paste0(i,'_decreased'))
}

save(df.seq, file="df.seq.Rdata")

Generating bigWig for motif enrichment plot

on xanadu:

#module load UCSC-tools/3.7.4 gcc/9.2.0 bedtools/2.29.2
#Unable to locate a modulefile for 'ucsc-tools'
module load gcc bedtools
/home/FCAM/ssun/miniconda3/bin/activate #activate miniconda
cd fimo_top_database_PSWM/
for bed in *2M.bed
do
    name=$(echo $bed | awk -F"/" '{print $NF}' | awk -F"_2M.bed" '{print $1}')
    echo $name
    #summing scores of motifs w/in overlapping genomic interval
    cat $bed | mergeBed -i stdin -c 4 -o sum > ${name}_merged_2M.bed
    bedGraphToBigWig ${name}_merged_2M.bed ../../../mm39.chrom.sizes ${name}_mm39_instances.bigWig
    rm ${name}_merged_2M.bed
done
module load R/4.1.2
R
library(lattice)
library(bigWig)

bed.window <- function(bed, half.window) {
  bed[,2] = (bed[,2]+bed[,3])/2-half.window 
  bed[,3] = bed[,2] + 2 * half.window 
  return(bed)
  }

load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/plot.df.atac.Rdata')
load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/fimo.scores.atac.Rdata')

plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$status = 'Increase'
plot.df[plot.df$supercluster == 'Transient Decrease' | plot.df$supercluster == 'Oscillating Decrease',]$status = 'Decrease'
all.fimo = data.frame(matrix(ncol = 4, nrow = 0))
colnames(all.fimo) = c('density', 'tf', 'category', 'range')

half.win = 600
file.suffix = '_mm39_instances.bigWig'
dir = '/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/fimo_top_database_PSWM'
setwd(dir)

decreased = plot.df[plot.df$status == 'Decrease',12:14]
decreased[,2] = as.numeric(decreased[,2])
decreased[,3] = as.numeric(decreased[,3])
decreased = bed.window(decreased,half.win)

increased = plot.df[plot.df$status == 'Increase',12:14]
increased[,2] = as.numeric(increased[,2])
increased[,3] = as.numeric(increased[,3])
increased = bed.window(increased,half.win)

not.different = read.table('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/nondynamic_peaks.bed')
not.different = not.different[not.different$V1 != 'chrM',]
not.different = bed.window(not.different,half.win)

all.fimo = data.frame()

for(i in 1:ncol(fimo.scores.atac)) {
    factor = colnames(fimo.scores.atac)[i]

    mod.bigWig = paste0(factor,file.suffix)
    factor.name = factor
    print(factor.name)

    loaded.bw = load.bigWig(mod.bigWig)
    
    dec.inten = bed.step.probeQuery.bigWig(loaded.bw, decreased,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    dec.query.df = data.frame(cbind(colMeans(dec.inten), factor.name,
                                    'Decrease', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(dec.query.df) = c('density', 'tf', 'category', 'range')
    
    inc.inten = bed.step.probeQuery.bigWig(loaded.bw, increased,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    inc.query.df = data.frame(cbind(colMeans(inc.inten), factor.name,
                                    'Increase', seq(-half.win,(half.win-10), 10)), stringsAsFactors=F)
    colnames(inc.query.df) = c('density', 'tf', 'category', 'range')
    
    ctrl.inten = bed.step.probeQuery.bigWig(loaded.bw, not.different,
                                            gap.value = 0, step = 10, as.matrix = TRUE)
    ctrl.query.df = data.frame(cbind(colMeans(ctrl.inten), factor.name,
                                     'unchanged', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(ctrl.query.df) = c('density', 'tf', 'category', 'range')
    
    tf.all = rbind(dec.query.df, inc.query.df, ctrl.query.df)

    all.fimo = rbind(all.fimo,tf.all)
}

all.fimo[,1] = as.numeric(all.fimo[,1])
all.fimo[,4] = as.numeric(all.fimo[,4])
plot.fimo.lattice <- function(dat, fact = 'Motif', summit = 'Hypersensitivity Summit', class= '',
                              num.m = -200, num.p =90, y.low =0, y.high = 0.2,
                              col.lines = c(rgb(0,0,1,1/2), rgb(1,0,0,1/2),
                                            rgb(0.1,0.5,0.05,1/2), rgb(0,0,0,1/2),
                                            rgb(1/2,0,1/2,1/2), rgb(0,1/2,1/2,1/2), rgb(1/2,1/2,0,1/2)),
                              fill.poly = c(rgb(0,0,1,1/4), rgb(1,0,0,1/4), rgb(0.1,0.5,0.05,1/4),
                                            rgb(0,0,0,1/4), rgb(1/2,0,1/2,1/4))) {
    
    pdf('motif_enrichment_around_summits.pdf')#, width=6.83, height=3.5)
    print(xyplot(density ~ range|tf, groups = category, data = dat, type = 'l',
                 scales=list(x=list(cex=0.8,relation = "free"), 
                 y =list(cex=0.8,axs = 'i',relation = "free")),
                 xlim=c(num.m,num.p),                                       
                 col = col.lines,
                 auto.key = list(points=F, lines=T, cex=0.8, columns = 2),
                 par.settings = list(superpose.symbol = list(pch = c(16), col=col.lines, cex =0.7),
                                     superpose.line = list(col = col.lines, lwd=c(2,2,2,2,2,2),
                                                           lty = c(1,1,1,1,1,1,1,1,1))),
                 cex.axis=1.0,
                 par.strip.text=list(cex=0.9, font=1, col='black',font=2),
                 aspect=1.0,
                 between=list(y=0.5, x=0.5),
                 index.cond = list(c(1,3,7,4,2,5,6)),
                 lwd=2,
                 ylab = list(label = "Weighted Motif Density", cex =1,font=2),
                 xlab = list(label = 'Distance from ATAC-seq Peak Summit', cex =1,font=2),
                 strip = function(..., which.panel, bg) {
                     bg.col = 'grey' #c("blue","grey65","red")
                     strip.default(..., which.panel = which.panel, 
                                    bg = rep(bg.col, length = which.panel)[which.panel])
                 }
                 ))
    dev.off()
}

plot.fimo.lattice(all.fimo, num.m = -500, num.p = 500,
                  col.lines = c('blue','red','grey65'))

## by supercluster module load R/4.1.2

library(lattice)
library(bigWig)

bed.window <- function(bed, half.window) {
  bed[,2] = (bed[,2]+bed[,3])/2-half.window 
  bed[,3] = bed[,2] + 2 * half.window 
  return(bed)
  }

load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/plot.df.atac.Rdata')
load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/fimo.scores.atac.Rdata')

plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$status = 'Increase'
plot.df[plot.df$supercluster == 'Transient Decrease' | plot.df$supercluster == 'Oscillating Decrease',]$status = 'Decrease'
all.fimo2 = data.frame(matrix(ncol = 4, nrow = 0))
colnames(all.fimo2) = c('density', 'tf', 'category', 'range')

half.win = 600
file.suffix = '_mm39_instances.bigWig'

gradual.up = plot.df[plot.df$supercluster == 'Gradual Increase',12:14]
gradual.up[,2] = as.numeric(gradual.up[,2])
gradual.up[,3] = as.numeric(gradual.up[,3])
gradual.up = bed.window(gradual.up,half.win)

up.down = plot.df[plot.df$supercluster == 'Transient Increase',12:14]
up.down[,2] = as.numeric(up.down[,2])
up.down[,3] = as.numeric(up.down[,3])
up.down = bed.window(up.down,half.win)

up.down.up = plot.df[plot.df$supercluster == 'Oscillating Increase',12:14]
up.down.up[,2] = as.numeric(up.down.up[,2])
up.down.up[,3] = as.numeric(up.down.up[,3])
up.down.up = bed.window(up.down.up,half.win)

down.up = plot.df[plot.df$supercluster == 'Transient Decrease',12:14]
down.up[,2] = as.numeric(down.up[,2])
down.up[,3] = as.numeric(down.up[,3])
down.up = bed.window(down.up,half.win)

down.up.down = plot.df[plot.df$supercluster == 'Oscillating Decrease',12:14]
down.up.down[,2] = as.numeric(down.up.down[,2])
down.up.down[,3] = as.numeric(down.up.down[,3])
down.up.down = bed.window(down.up.down,half.win)


not.different = read.table('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/new_denovo_motif/nondynamic_peaks.bed')
not.different = not.different[not.different$V1 != 'chrM',]
not.different = bed.window(not.different,half.win)


all.fimo2 = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    factor = colnames(fimo.scores.atac)[i]

    mod.bigWig = paste0(factor,file.suffix)
    factor.name = factor
    print(factor.name)

    loaded.bw = load.bigWig(mod.bigWig)
    
    
    gradual.up.inten = bed.step.probeQuery.bigWig(loaded.bw, gradual.up,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    gradual.up.query.df = data.frame(cbind(colMeans(gradual.up.inten), factor.name,
                                    'Gradual Increase', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(gradual.up.query.df) = c('density', 'tf', 'category', 'range')
    
    
    up.down.inten = bed.step.probeQuery.bigWig(loaded.bw, up.down,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    up.down.query.df = data.frame(cbind(colMeans(up.down.inten), factor.name,
                                    'Transient Increase', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(up.down.query.df) = c('density', 'tf', 'category', 'range')
    
    
    up.down.up.inten = bed.step.probeQuery.bigWig(loaded.bw, up.down.up,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    up.down.up.query.df = data.frame(cbind(colMeans(up.down.up.inten), factor.name,
                                    'Oscillating Increase', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(up.down.up.query.df) = c('density', 'tf', 'category', 'range')
    
    
    down.up.inten = bed.step.probeQuery.bigWig(loaded.bw, down.up,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    down.up.query.df = data.frame(cbind(colMeans(down.up.inten), factor.name,
                                    'Transient Decrease', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(down.up.query.df) = c('density', 'tf', 'category', 'range')
    
    
    down.up.down.inten = bed.step.probeQuery.bigWig(loaded.bw, down.up.down,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    down.up.down.query.df = data.frame(cbind(colMeans(down.up.down.inten), factor.name,
                                    'Oscillating Decrease', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(down.up.down.query.df) = c('density', 'tf', 'category', 'range')
    
    
    
    ctrl.inten = bed.step.probeQuery.bigWig(loaded.bw, not.different,
                                            gap.value = 0, step = 10, as.matrix = TRUE)
    ctrl.query.df = data.frame(cbind(colMeans(ctrl.inten), factor.name,
                                     'unchanged', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(ctrl.query.df) = c('density', 'tf', 'category', 'range')
    
    
    
    
    tf.all = rbind(gradual.up.query.df, up.down.query.df, up.down.up.query.df, down.up.query.df, down.up.down.query.df, ctrl.query.df)

    all.fimo2 = rbind(all.fimo2, tf.all)
}

all.fimo2[,1] = as.numeric(all.fimo2[,1])
all.fimo2[,4] = as.numeric(all.fimo2[,4])


save(all.fimo2, file="all.fimo2.Rdata")
plot.fimo.lattice <- function(dat, fact = 'Motif', summit = 'Hypersensitivity Summit', class= '',
                              num.m = -200, num.p =90, y.low =0, y.high = 0.2,
                              col.lines = c(rgb(0,0,1,1/2), rgb(1,0,0,1/2),
                                            rgb(0.1,0.5,0.05,1/2), rgb(0,0,0,1/2),
                                            rgb(1/2,0,1/2,1/2), rgb(0,1/2,1/2,1/2), rgb(1/2,1/2,0,1/2)),
                              fill.poly = c(rgb(0,0,1,1/4), rgb(1,0,0,1/4), rgb(0.1,0.5,0.05,1/4),
                                            rgb(0,0,0,1/4), rgb(1/2,0,1/2,1/4))) {
    
    pdf('motif_enrichment_around_summits_by_cluster.pdf')#, width=6.83, height=3.5)
    print(xyplot(density ~ range|tf, groups = category, data = dat, type = 'l',
                 scales=list(x=list(cex=0.8,relation = "free"), 
                 y =list(cex=0.8,axs = 'i',relation = "free")),
                 xlim=c(num.m,num.p),                                       
                 col = col.lines,
                 auto.key = list(points=F, lines=T, cex=0.8, columns = 2),
                 par.settings = list(superpose.symbol = list(pch = c(16), col=col.lines, cex =0.7),
                                     superpose.line = list(col = col.lines, lwd=c(2,2,2,2,2,2),
                                                           lty = c(1,1,1,1,1,1,1,1,1))),
                 cex.axis=1.0,
                 par.strip.text=list(cex=0.9, font=1, col='black',font=2),
                 aspect=1.0,
                 between=list(y=0.5, x=0.5),
                 index.cond = list(c(1,3,7,4,2,5,6)),
                 lwd=1,
                 ylab = list(label = "Weighted Motif Density", cex =1,font=2),
                 xlab = list(label = 'Distance from ATAC-seq Peak Summit', cex =1,font=2),
                 strip = function(..., which.panel, bg) {
                     bg.col = 'grey'
                     strip.default(..., which.panel = which.panel, 
                                    bg = rep(bg.col, length = which.panel)[which.panel])
                 }
                 ))
    dev.off()
}

plot.fimo.lattice(all.fimo2, num.m = -500, num.p = 500,
                  col.lines = c('lightsalmon','blue','red','cornflowerblue','tomato3','grey65'))

motif_density_bycluster

13.1.1 Q: is a motif specific to clusters/group of clusters that share kinetic profiles?

AP1, JUN, SP1, NFY (based on chi test and bar plot);

13.1.2 Plot traces of isolated peaks

only for the four candidates motif: AP1, JUN, SP1, NFY

plot.df = data.frame()

fimo.scores.atac.subset<-fimo.scores.atac[, c(1,3,7,4)]

for(i in 1:ncol(fimo.scores.atac.subset)) {
      scores.temp = fimo.scores.atac.subset[!is.na(fimo.scores.atac.subset[,i]),]
      scores.temp = scores.temp[,-i]
      scores.temp = scores.temp[which(rowSums(is.na(scores.temp)) == ncol(scores.temp)),]
      temp = plot.df.atac[plot.df.atac$genes %in% rownames(scores.temp),]
      temp$family = colnames(fimo.scores.atac.subset)[i]
      plot.df = rbind(plot.df,temp)
}

plot.df$genes = as.factor(plot.df$genes)
cat.colours = plot.df[plot.df$merge == 'one_groupt0',]
cat.colours$genes <- as.factor(cat.colours$genes)
cat.colours$supercluster <- as.factor(cat.colours$supercluster)

cat.colours$colour[cat.colours$supercluster == 'Oscillating Increase'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'Transient Increase'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'Gradual Increase'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'Oscillating Decrease'] <- '#0000FF08'
cat.colours$colour[cat.colours$supercluster == 'Transient Decrease'] <- '#0000FF08'

cat.colours$colour <- as.factor(cat.colours$colour)

cat.colours <- cat.colours[match(levels(plot.df$genes), cat.colours$genes), ]
pdf(file = 'four_isolated_families_dynamic_accessibility.pdf',width=14,height=14)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))

print(
    xyplot(value ~  sample.conditions | family, group = genes, data = plot.df, type = c('l'), #type = c('l','p'),
       scales=list(x=list(cex=1.5,relation = "free", rot = 45), y =list(cex=1.5, relation="free")),
       aspect=1.0,
       layout = c(4,1),
       between=list(y=0.5, x=0.5),
       
       main = list(label = 'Four Isolated Family Motifs in Dynamic Peaks', cex = 2.0),
       ylab = list(label = 'Normalized ATAC signal', cex =2.0),
       xlab = list(label = 'Time (minutes)', cex =2.0),
       par.settings = list(superpose.symbol = list(pch = c(16),col=c('grey20'), cex = 2.0),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1),lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           #panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15, do.out = FALSE)
           #panel.spline(x, y, col ='blue', lwd =2.0, ...) 
           
       })

)
dev.off()

traces

# calculate trace number for AP1, JUN, SP1, NFY 
plot.df$status = 'Increase'
plot.df[plot.df$supercluster == 'Transient Decrease' | plot.df$supercluster == 'Oscillating Decrease',]$status = 'Decrease'

plot.df.AP1<-plot.df[plot.df$family=="AP1",]
plot.df.JUN<-plot.df[plot.df$family=="Jun-related factor",]
plot.df.SP1<-plot.df[plot.df$family=="SP1-like factor",]
plot.df.NFY<-plot.df[plot.df$family=="NFY",]

length(unique(plot.df.AP1[plot.df.AP1$status=="Increase",]$genes))
length(unique(plot.df.AP1[plot.df.AP1$status=="Decrease",]$genes))

length(unique(plot.df.JUN[plot.df.JUN$status=="Increase",]$genes))
length(unique(plot.df.JUN[plot.df.JUN$status=="Decrease",]$genes))

length(unique(plot.df.SP1[plot.df.SP1$status=="Increase",]$genes))
length(unique(plot.df.SP1[plot.df.SP1$status=="Decrease",]$genes))

length(unique(plot.df.NFY[plot.df.NFY$status=="Increase",]$genes))
length(unique(plot.df.NFY[plot.df.NFY$status=="Decrease",]$genes))

traces